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ABSTRACT  (Maximum  2C0  worojJ 

This  thesis  explores  using  busses  in  communication  architectures  and  control  structures.  First,  we 
investigate  the  organization  of  permutation  architectures  with  bussed  interconnections.  We  explore  how 
to  efficiently  permute  data  among  VLSI  chips  in  accordance  with  a  predetermined  set  of  permutations.  By 
connecting  chips  with  shared  bus  interconnections,  as  opposed  to  point-to-point  interconnections,  we 
show  that  the  number  of  pins  per  chip  can  often  be  reduced.  The  results  are  derived  from  a  mathematical 
characterization  of  uniform  permutation  architectures  based  on  the  combinatorial  notion  of  a  difference 
cover.  Second,  we  explore  priority  arbitration  schemes  that  use  busses  to  arbitrate  among  n  modules. 
We  investigate  schemes  that  use  Ig  n  <  m  <  n  busses  and  asynchronous  combinationa  arbitration  logic. 
The  standard  binary  arbitration  scheme  uses  m  =  Ig  n  busses  and  arbitrates  in  /  =  Ig  n  time.  We  present 
the  binomial  arbitration  scheme  that  uses  m  =  Ig  n  + 1  busses  and  arbitrates  in  t  =  1/2  Ig  n  time.  We 
generalize  binomial  arbitration  to  achieve  a  bus-time  tradeoff  m  =  0(lnu'). 

The  new  schemes  are  based  on  data-dependent  analysis  and  can  be  adopted  with  no  changes  to 
existing  protocols.  Third,  we  examine  the  performance  of  binary  arbitration  in  a  digital  transmission  line 
bus  model  We  show  that  arbitration  time  depends  on  the  arrangement  of  modules.  For  general 
arrangements,  arbitration  time  grows  linearly  with  number  of  busses,  while  for  linear  arrangements. 
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Abstract 

This  thesis  explores  using  busses  in  communication  architectures  and  control  structures.  First, 
we  investigate  the  organization  of  permutation  architectures  with  bussed  interconnections.  We 
explore  how  to  efficiently  permute  data  among  VLSI  chips  in  accordance  with  a  predetermined  set 
of  permutations.  By  connecting  chips  with  shared  bus  interconnections,  as  opposed  to  point-to- 
point  interconnections,  we  show  that  the  number  of  pins  per  chip  can  often  be  reduced.  The  results 
are  derived  from  a  mathematical  characterization  of  uniform  permutation  architectures  based  on 
the  combinatorial  notion  of  a  difference  cover.  Second,  we  explore  priority  arbitration  schemes  that 
use  busses  to  arbitrate  among  n  modules.  We  investigate  schemes  that  use  lg  n  <  m  <  n  busses 
and  asynchronous  combinational  arbitration  logic.  The  standard  binary  arbitration  scheme  uses 
m  =  lg  n  busses  and  arbitrates  in  t  =  lg  n  time.  We  present  the  binomial  arbitration  scheme  that 
uses  m  ~  lg  n  -f  1  busses  and  arbitrates  in  t  =  |  lg  n  time.  We  generalize  binomial  arbitration  to 
achieve  a  bus-time  tradeoff  m  =  0(<n1/'<).  The  new  schemes  are  based  on  data-dependent  analysis 
and  can  be  adopted  with  no  changes  to  existing  protocols.  Third,  we  examine  the  performance  of 
binary  arbitration  in  a  digital  transmission  line  bus  rrodel.  We  show  that  arbitration  time  depends 
on  the  arrangement  of  modules.  For  general  arrangements,  arbitration  time  grows  linearly  with 
number  of  busses,  while  for  linear  arrangements,  arbitration  time  is  constant. 
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Abstract 

This  thesis  investigates  several  aspects  of  the  organization  of  digital  systems  that  employ  bussed 
interconnections.  The  thesis  focuses  on  two  application  domains  for  busses:  communication 
architectures  and  control  mechanisms,  and  explores  the  capabilities  of  busses  as  interconnection 
media,  computation  devices,  and  transmission  channels. 

Chapter  1  discusses  the  significance  of  bussed  interconnect  in  digital  systems,  provides  some 
background  on  busses,  and  describes  the  problems  addressed  in  this  thesis. 

In  Chapter  2  we  investigate  the  organization  of  permutation  architectures  that  employ 
bussed  interconnections.  We  explore  the  problem  of  efficiently  permuting  data  stored  in  VLSI 
chips  in  accordance  with  a  predetermined  set  of  permutations.  By  connecting  chips  with  shared 
bus  interconnections,  as  opposed  to  point-to-point  interconnections,  we  show  that  the  number 
of  pins  per  chip  can  often  be  reduced.  For  example,  we  exhibit  permutation  architectures  with 
fv/n]  pins  per  chip  that  can  realize  any  of  the  n  cyclic  shifts  on  n  chips  in  one  clock  tick. 
When  the  set  of  permutations  forms  a  group  with  p  elements,  any  permutation  in  the  group 
can  be  realized  in  one  clock  tick  by  an  architecture  with  0(\/plgp )  pins  per  chip.  When 
the  permutation  group  is  abelian,  we  show  that  0(y/p)  pins  suffice.  These  results  are  all 
derived  from  a  mathematical  characterization  of  uniform  permutation  architectures  based  on  the 
combinatorial  notion  of  a  difference  cover.  We  also  consider  uniform  permutation  architectures 
that  realize  permutations  in  several  clock  ticks,  instead  of  one,  and  show  that  further  savings 
in  the  number  of  pins  per  chip  can  be  obtained. 

Chapter  3  explores  efficient  utilization  of  busses  for  implementing  arbitration  mechanisms. 
We  investigate  priority  arbitration  schemes  that  use  busses  to  arbitrate  among  n  modules  in  a 
digital  system.  We  focus  on  distributed  mechanisms  that  employ  m  busses,  for  lg  n  <  m  <  n, 
and  use  asynchronous  combinational  arbitration  logic.  A  widely  used  distributed  asynchronous 
mechanism  is  the  binary  arbitration  scheme,  which  with  m  =  Ign  busses  arbitrates  in  t  =  lgn 
units  of  bus-settling  time.  We  present  a  new  asynchronous  scheme  —  binomial  arbitration  — 
that  by  using  m  =  lg  n  +  1  busses  reduces  the  arbitration  time  to  t  =  i  lg  n.  Extending  this 
result,  we  present  the  generalized  binomial  arbitration  scheme  that  achieves  a  bus-time  tradeoff 
of  the  form  m  =  0(tn,/‘)  between  the  number  of  arbitration  busses  m,  and  the  arbitration 
time  t  (in  units  of  bus-settling  time),  for  values  of  1  <  t  <  lg  n  and  lg  n  <  m  <  n.  Our  schemes 
are  based  on  a  novel  analysis  of  data- dependent  delays.  Most  importantly,  our  schemes  can  be 
adopted  with  no  changes  to  existing  hardware  and  protocols;  they  merely  involve  selecting  a 
good  set  of  priority  arbitration  codewords. 
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In  Chapter  4,  we  examine  the  performance  of  priority  arbitration  schemes  presented  in 
Chapter  3  under  the  digital  transmission  line  bus  model.  This  bus  model  accounts  for  the 
propagation  time  of  signals  along  bus  lines  and  assumes  that  the  propagating  signals  are  always 
valid  digital  signals.  A  widely  held  misconception  is  that  in  the  digital  transmission  line  model 
the  arbitration  time  of  the  binary  arbitration  scheme  is  at  most  4  units  of  bus-propagation  delay. 
We  formally  disprove  this  conjecture  by  demonstrating  that  the  arbitration  time  of  the  binary 
arbitration  scheme  is  heavily  dependent  on  the  arrangement  of  the  arbitrating  modules  in  the 
system.  We  provide  a  general  scenario  of  module  arrangement  on  m  busses,  for  which  binary 
arbitration  takes  at  least  m/2  units  of  bus-propagation  delay  to  stabilize.  We  also  prove  that 
for  general  arrangements  of  modules  on  m  busses,  binary  arbitration  settles  in  at  most  m/2  +  2 
units  of  bus-propagation  delay,  while  binomial  arbitration  settles  in  at  most  m/4  +  2  units  of 
bus-propagation  delay,  thereby  demonstrating  the  superiority  of  binomial  arbitration  for  general 
arrangements  of  modules  under  the  digital  transmission  line  model.  For  linear  arrangements  of 
modules  in  increasing  order  of  priorities  and  equal  spacings  between  modules,  we  show  that  3 
units  of  bus-propagation  delay  are  necessary  for  binary  arbitration  to  settle,  and  we  sketch  an 
argument  that  3  units  of  bus-propagation  delay  are  also  asymptotically  sufficient. 

Finally,  Chapter  5  provides  some  concluding  remarks  and  identifies  directions  for  further 
research  on  systems  with  bussed  interconnections. 

Keywords:  arbitration,  arbitration  protocol,  asynchronous  arbitration,  binary  arbitration, 
binomial  arbitration,  bus- propagation  time,  bus-settling  time,  bus-time  tradeoff,  bussed  inter¬ 
connections,  busses,  cyclic  shifter,  data-dependent  delays,  difference  cover,  digital  transmission 
line,  generalized  binomial  arbitration,  linear  arbitration,  permutation  architecture,  permutation 
set,  priority  arbitration,  signal  propagation,  uniform  architecture,  VLSI. 
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Introduction 


This  thesis  investigates  several  aspects  of  the  organization  of  systems  with  bussed  interconnec¬ 
tions.  Busses  are  used  in  many  electronic  and  computer  systems  for  a  variety  of  applications, 
including  broadcasting  information,  realizing  communication  patterns,  implementing  system 
primitives,  and  performing  computations.  Busses  come  in  all  shapes  and  sizes  and  connect 
modules  at  various  system  levels.  Busses  are  the  backbone  of  many  digital  systems  and  play  a 
vital  role  in  numerous  architectures. 

Busses  are  desirable  in  many  systems  due  to  their  simplicity,  modularity,  reliability,  and 
monitoring  capabilities.  Busses  constitute  shared  media  to  which  connected  modules  can  listen 
and  onto  which  they  can  broadcast.  Busses  offer  scalable-cost  interconnect,  standard  module 
interface,  and  configuration  flexibility.  Bussed  organizations  are  easy  to  control  and  monitor, 
and  provide  a  high  level  of  reliability  at  moderate  cost. 

Busses  have  been  extensively  researched  in  the  electrical  engineering  and  computer  science 
literature  (see  references).  Various  aspects  of  busses  have  been  investigated,  including  the 
physical  and  electrical  characteristics  of  the  media,  interconnection  topologies,  communication 
protocols,  and  algorithmic  techniques,  among  others.  Bussed  interconnections  are  still  not  fully 
understood,  however,  and  their  capabilities  are  not  fully  exploited.  Due  to  the  widespread  use 
of  busses  for  applications  in  electronic  and  computer  systems,  it  is  important  to  develop  a  better 
understanding  of  the  organization  and  capabilities  of  systems  with  bussed  interconnections.  In 
this  thesis,  we  investigate  several  organizational  aspects  of  digital  systems  that  employ  bussed 
interconnections  and  demonstrate  how  to  use  busses  more  efficiently  for  implementing  several 
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system  functions.  Although  the  results  of  this  thesis  are  presented  with  computer  systems  and 
computer  busses  in  mind,  they  are  not  limited  to  these  settings  and  are  applicable  to  general 
systems  that  employ  communication  over  shared  media. 

This  thesis  is  organized  as  follows.  In  this  chapter,  we  discuss  several  issues  of  bussed 
interconnections  that  are  relevant  to  our  work  and  describe  the  problems  addressed  in  this 
thesis.  The  body  of  the  thesis  focuses  on  two  application  domains  for  shared  interconnect, 
communication  architectures  and  control  mechanisms,  and  examines  the  capabilities  of  busses 
as  interconnection  media,  computation  devices,  and  transmission  channels.  In  Chapter  2,  we 
investigate  the  organization  of  permutation  architectures  that  employ  bussed  interconnections. 
Chapter  3  explores  how  to  implement  priority  arbitration  mechanisms  efficiently  on  busses 
that  exhibit  fixed  settling  delay.  In  Chapter  4,  we  examine  the  performance  of  some  priority 
arbitration  schemes  under  the  digital  transmission  line  model.  Finally,  Chapter  5  presents 
some  concluding  remarks  and  directions  for  further  research  concerning  systems  with  bussed 
interconnections. 


1.1  Bussed  interconnections 

Busses  are  shared  communication  media.  Many  digital  systems  employ  one  or  more  busses 
to  communicate  among  system  modules.  Busses  enable  several  devices  sharing  the  same  in¬ 
terconnection  medium  to  communicate,  in  contrast  with  point-to-point  wires  that  establish 
communication  only  between  pairs  of  devices. 

Several  technologies  of  shared  interconnect  can  be  classified  as  busses,  including  broadcast 
radio  channels,  electrical  wires,  and  optical  fibers.  The  focus  of  this  thesis  is  on  electrical 
busses,  which  are  used  by  most  computer  systems.  Extensive  surveys  and  tutorials  on  the 
characteristics  of  electrical  busses  appear  in  [16,  22,  40,  57,  82,  88].  Discussion  of  other  shared 
communication  media  can  be  found,  for  example,  in  [12,  61,  78].  In  this  section,  we  briefly 
introduce  and  discuss  several  issues  of  electrical  busses  that  are  important  for  the  development 
of  this  thesis  and  we  comment  on  their  relevance.  We  present  these  issues  in  a  somewhat 
bottom-up  manner. 

Bus  driving  technologies.  There  are  several  standard  technologies  for  driving  digital 
signals  onto  an  electrical  bus.  One  common  bus-driving  technology  is  the  tri-state  driver,  where 
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a  device  driver  applies  either  a  logic  level  of  0,  a  logic  level  of  1,  or  disables  its  output  terminal 
and  leaves  it  floating  (see  [22,  62,  88]).  Tri-state  drivers  consume  little  power,  but  can  only 
be  used  when  it  is  guaranteed  that  at  all  times  no  more  than  one  device  drives  the  bus,  while 
all  other  devices  disable  their  drivers.  This  requirement  must  be  met,  since  otherwise  devices 
may  fight  each  other,  resulting  in  high-current  spikes,  intermediate  voltage  levels  on  the  bus. 
and  possible  component  failure.  Another  common  bus-driving  technology  is  the  open-collector 
driver,  where  an  external  pullup  drives  the  bus  to  a  default  logic  level  and  device  drivers  can 
pull  the  bus  down  to  express  the  nondefault  logic  value  (see  [22,  40,  88]).  The  open-collector 
technology  allows  the  bus  to  implement  a  wired-OR  logic  function,  since  several  devices  can 
pull  the  bus  down  simultaneous,  resulting  in  the  OR  of  the  logic  values  applied.  (Another 
technology  for  implementing  wired-OR  is  to  charge  and  discharge  a  VLSI  bus  line  that  is  treated 
as  a  large  capacitor  (see  [62,  S3] ). )  In  this  thesis,  we  explore  both  tri-state  and  open-collector 
drivers.  The  results  of  Chapter  2  can  use  either  tri-state  or  open-collector  busses,  while  Chapters 
3  and  4  make  use  of  open-collector  busses. 

Bus  signal  propagation.  A  bus,  being  a  physical  element,  has  several  physical  and 
electrical  characteristics.  The  propagation  of  a  signal  on  a  bus  takes  time,  which  depends 
on  the  length,  material,  shape,  temperature,  and  other  physical  properties  of  the  bus  and  its 
environment.  A  high-speed  bus  is  modeled  as  an  analog  transmission  line  with  associated 
impedance  that  depends  on  the  inductance,  the  capacity,  and  the  length  of  the  bus  (see  [5.  40]). 
Most  computer  systems,  however,  use  the  digital  abstraction,  which  specifies  certain  discrete 
voltage  levels  for  representing  logic  values.  Digital  signals  driven  onto  a  bus  require  time  to 
propagate  and  to  resolve  various  transient  effects  before  the  bus  reaches  a  valid  logic  level. 
In  designing  digital  bus  primitives  and  protocols,  careful  attention  must  be  given  to  modeling 
the  bus  appropriately  and  to  allowing  enough  time  for  the  bus  to  settle  before  the  logic  value 
that  it  carries  can  be  reliably  used.  In  this  thesis  we  use  the  digital  abstraction  of  busses. 
In  Chapter  2,  busses  are  used  as  interconnection  media  and  we  assume  that  sufficient  time  is 
allocated  for  signal  propagation  along  a  bus.  In  Chapters  3  and  4,  busses  may  be  driven  by 
multiple  modules  and  may  carry  transient  signals.  Chapter  3  assumes  that  the  bus-settling  time, 
denoted  by  7bu«.  is  accounted  for,  while  Chapter  4  analyzes  the  effects  of  signal  propagation 
along  idealized  digital  transmission  lines  with  bus-propagation  time  of  Tp. 
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Number  and  functionality  of  bus  lines.  Bussed  systems  vary  considerably  in  the 
number  of  bus  lines  they  use  and  in  their  functionality.  A  single  bus  line  can  only  implement 
one  communication  transaction  at  any  given  time  and  its  performance,  therefore,  degrades 
when  the  number  of  modules  connected  to  it  increases;  the  latency  of  a  bus  with  n  modules 
is  0(n)  and  its  throughput  is  0(l/n).  However,  many  bussed  systems  use  a  single  bus  line 
for  serial  communication  when  the  cost  associated  with  multiple  lines  is  too  high  or  when  the 
functionality  of  the  bus  does  not  justify  multiple  lines  (see  [16,  22,  61,  88]).  Most  backplane  bus 
systems,  on  the  other  hand,  use  a  collection  of  bus  lines  to  provide  high  bandwidth  connections 
between  system  modules  (see  [16,  22,  40]).  Such  systems  use  parallel  communication  to  transfer 
several  bits  concurrently,  thereby  reducing  the  time  that  the  bus  system  is  occupied  by  any 
given  transaction.  In  addition,  several  multiplexing  techniques  enable  multiple  transactions 
over  the  same  collection  of  bus  lines  by  using  time  sharing  or  frequency  sharing  of  the  busses. 
Another  common  method  for  enhancing  system  connectivity  and  performance  is  the  use  of 
multiple  busses  to  establish  concurrent  and  independent  communication  channels  among  system 
modules  or  subsets  of  them  (see  [10,  13,  30,  54,  64,  69,  70,  73,  77]).  In  this  thesis,  we  focus 
on  multiple  and  parallel  bus  lines.  Chapter  2  uses  multiple  busses  to  establish  concurrent  and 
independent  communication  channels  among  subsets  of  modules  and  Chapters  3  and  4  explore 
how  to  efficiently  employ  parallel  bus  lines  that  are  shared  among  all  system  modules. 

Bus  timing  disciplines.  To  control  the  behavior  of  a  complex  digital  system,  one  of 
several  timing  disciplines  is  used  (see  [22,  62,  88]).  There  are  two  orthogonal  dimensions  to 
distinguish  between  timing  disciplines:  synchronous  vs.  asynchronous  and  global  vs.  local.  In 
a  synchronous  system,  there  is  a  systemwide  notion  of  time,  generally  established  by  using  sys¬ 
temwide  clock  signals,  that  is  used  for  timing  and  coordinating  transactions.  Bus  transactions, 
in  a  synchronous  system,  start  at  some  clock  edge  and  finish  at  a  subsequent  clock  edge,  taking 
an  integral  multiple  of  clock  cycles  to  complete.  An  asynchronous  system,  in  contrast,  does  not 
time  operations  but  rather  coordinates  them  through  the  use  of  hand-shaking  protocols.  Bus 
transactions,  in  an  asynchronous  system,  can  start  and  finish  at  any  time  and  their  duration 
is  self  determined.  In  globally  timed  systems  each  operation  takes  a  fixed  and  predetermined 
amount  of  time,  while  in  locally  timed  systems  modules  can  control  the  duration  of  different 
operations  by  using  several  control  signals.  These  two  orthogonal  dimensions  of  classifying 
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timing  disciplines  give  rise  to  four  general  classes  of  timing  disciplines:  Synchronous  Globally 
Timed  (SGT),  Asynchronous  Globally  Timed  (AGT),  Synchronous  Locally  Timed  (SLT),  and 
Asynchronous  Locally  Timed  (ALT).  The  choice  between  these  timing  disciplines  depends  on 
the  purpose,  performance,  and  cost  of  the  designed  system.  In  this  thesis,  we  focus  on  the  SGT, 
AGT,  and  ALT  timing  disciplines.  The  architectures  of  Chapter  2  use  the  synchronous  globally 
timed  discipline,  while  Chapters  3  and  4  explore  asynchronous  globally  timed  and  asynchronous 
locally  timed  mechanisms. 

Bus  arbitration  and  mastership.  Since  a  bus  is  shared  among  several  system  mod¬ 
ules,  situations  may  arise  where  the  bus  is  simultaneously  requested  by  more  than  one  module. 
To  allocate  the  bus  to  one  module  at  a  time,  an  arbitration/access  mechanism  is  required 
that  determines  the  mastership  of  the  bus.  Numerous  arbitration/access  mechanisms  have 
been  developed,  including  daisy  chains,  priority  circuits,  polling,  token  passing,  and  carrier 
sense  multiple  access  protocols  (see  [12,  16,  22,  40,  57,  61,  78,  82,  88]).  A  distinction  is  of¬ 
ten  made  between  centralized  arbitration/access  mechanisms,  where  bus  arbitration  and  access 
are  determined  by  a  central  controller,  and  distributed  arbitration/access  mechanisms,  where 
arbitration  and  access  processes  are  carried  out  simultaneously  by  all  system  modules.  Cen¬ 
tralized  controllers  are  generally  simpler,  operate  fast,  and  are  more  flexible  in  their  assignment 
procedures.  Distributed  controllers,  on  the  other  hand,  are  usually  more  reliable,  require  less 
dedicated  wiring  and  communication,  and  are  easier  to  monitor  and  expand.  Many  tightly 
coupled  systems,  such  as  SIMD  parallel  machines  and  high-performance  architectures,  use  cen¬ 
tral  control  mechanisms,  while  more  loosely  coupled  systems,  such  as  multiprocessor  systems 
and  data  communication  networks,  employ  distributed  arbitration/access  mechanisms.  In  this 
thesis,  both  centralized  and  distributed  control  mechanisms  are  explored.  The  permutation  ar¬ 
chitectures  described  in  Chapter  2  use  a  centralized  bus  mastership  procedure,  while  Chapters 
3  and  4  investigate  distributed  arbitration  mechanisms  with  busses. 

Bus  transactions.  Busses  can  be  used  to  implement  several  types  of  communication 
transactions  that  can  be  characterized  by  the  sets  of  modules  involved.  The  most  common 
types  of  bus  transactions  are  one-to-one,  where  a  single  module  transmits  data  intended  for 
a  single  receiver,  and  one-to-many  (broadcast),  where  a  single  module  sends  information  to 
multiple  receivers.  The  receiver  (receivers)  of  bus  transactions  are  typically  identified  by  their 
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address  or  through  external  control.  Two  other  types  of  transactions,  which  are  less  frequently 
implemented  on  busses,  are  the  many-to-one  (converge)  and  many-to-many  (multicast)  commu¬ 
nication  patterns.  In  these  transactions,  several  modules  may  try  to  transmit  information  con¬ 
currently  over  the  same  media,  which  requires  some  means  of  combining  or  selecting  among  the 
different  requests.  This  thesis  investigates  some  of  these  bus  transactions.  Chapter  2  deals  with 
realizing  permutations  (one-to-one  transactions)  over  bussed  interconnections,  while  Chapters 
3  and  4  use  broadcast  (one-to-many  transactions)  and  multicast  (many-to-many  transactions) 
over  wired-OR  busses. 


1.2  Focus  and  contribution  of  this  thesis 

Bussed  interconnections  are  used  for  many  applications  in  electronic  and  computer  systems. 
This  thesis  focuses  on  two  application  domains  for  busses:  communication  architectures  and 
control  mechanisms,  and  examines  the  capabilities  of  busses  as  interconnection  media,  compu¬ 
tation  devices,  and  transmission  channels.  The  following  subsections  describe  the  contribution 
of  the  thesis  chapters  and  put  the  results  of  this  thesis  in  perspective. 

1.2.1  Communication  architectures 

The  interconnection  network  of  a  digital  system,  which  connects  the  system  modules  to  each 
other,  has  a  profound  impact  on  the  system’s  capabilities,  performance,  size,  and  cost.  Several 
interconnection  schemes  have  been  heavily  studied  and  are  used  in  many  systems,  including 
point-to-point  wires,  multistage  interconnection  networks,  and  shared  busses.  Because  of  the 
costs  associated  with  wiring  and  packaging,  it  is  generally  desirable  to  minimize  the  number  of 
wires  in  a  system  and  the  number  of  connections  per  module. 

Chapter  2  of  this  thesis  investigates  how  busses  (multiple- pin  wires)  can  be  employed  to 
efficiently  realize  certain  communication  patterns  among  modules  in  a  digital  system.  We 
concentrate  on  the  problem  of  efficiently  permuting  data  stored  in  VLSI  chips  (modules)  in 
accordance  with  a  predetermined  set  of  permutations.  We  show  that  by  connecting  modules 
with  shared  bus  interconnections,  as  opposed  to  point-to-point  interconnections,  the  number  of 
pins  per  module  can  often  be  significantly  reduced. 
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Much  research  has  focused  on  implementing  permutations  and  various  other  communication 
patterns  on  different  interconnection  networks.  By  using  point-to-point  wires,  for  example,  any 
communication  pattern  can  be  realized  in  one  communication  cycle.  For  rich  and  diverse 
communication  patterns,  however,  full  point-to-point  interconnections  tend  to  use  many  wires 
and  many  connections  per  module,  since  any  two  modules  that  need  to  communicate  must 
share  a  wire.  (See  [60,  83]  for  VLSI  costs  of  point-to-point  interconnection  schemes.)  Multistage 
interconnection  networks  have  also  been  heavily  investigated  for  the  purpose  of  realizing  general 
communication  patterns  and  more  specifically  for  routing  permutations  (see  [6,  7.  27,  32,  37, 
52,  53,  55,  74,  75,  86]).  Many  multistage  interconnection  networks  exhibit  logarithmic  number 
of  stages  and  constant  number  of  connections  per  module.  However,  the  savings  in  the  number 
of  pins  per  module  come  at  the  expense  of  realizing  permutations  in  logarithmic  number  of 
communication  cycles  and  the  use  of  a  considerable  amount  of  switching  hardware.  The  use  of 
busses  as  the  interconnection  infrastructure  for  realizing  communication  patterns  has  also  been 
examined  by  several  researchers  (see  [10,  13,  30.  64,  73,  77]).  In  this  thesis  we  demonstrate  that 
bussed  interconnections  can  be  employed  for  realizing  general  classes  of  permutations  in  one 
communication  cycle,  with  considerably  small  number  of  pins  per  module,  and  with  virtually 
no  switching  and  controlling  hardware. 

In  Chapter  2,  we  exhibit  bussed  permutation  architectures  for  many  classes  of  permutation 
sets.  For  example,  we  present  permutation  architectures  that  with  0(y/n )  pins  per  module  can 
realize  any  of  the  n  cyclic  shifts  on  n  modules  in  one  communication  cycle.  Our  results  are 
derived  from  a  mathematical  characterization  of  uniform  permutation  architectures  based  on  the 
combinatorial  notion  of  a  difference  cover.  We  extend  our  discussion  to  permutation  groups  and 
show  that  when  the  set  of  permutations  forms  a  group  with  p  elements,  any  permutation  in  the 
group  can  be  realized  in  one  communication  cycle  by  a  uniform  architecture  with  0(y/p\gp )  pins 
per  module.  Furthermore,  when  the  permutation  group  is  abelian,  we  show  that  0(  v/p )  pins  per 
module  suffice.  We  also  consider  uniform  permutation  architectures  that  realize  permutations 
in  several  communication  cycles,  instead  of  one,  and  show  that  further  savings  in  the  number 
of  pins  per  module  can  be  obtained.  Finally,  we  identify  many  permutation  networks  that  can 
benefit  from  our  methodology  of  using  difference  covers  for  designing  uniform  architectures, 
including  hypercubes,  multidimensional  meshes,  and  shuffle-exchange  networks. 


18 


CHAPTER  1.  INTRODUCTION 


1.2.2  Control  mechanisms 

Large  digital  systems  use  control  mechanisms  for  several  functions,  including  establishing  timing 
disciplines,  triggering  events,  and  sequencing  transactions.  The  complexity  of  a  large  digital 
system  generally  calls  for  the  separation  of  the  control  mechanisms  from  the  communication 
and  computation  structures.  Description  of  control  mechanisms  for  digital  systems  appear  in 
[20,  88],  for  bus  systems  in  [16,  22,  40,  57,  82],  and  for  communication  networks  in  [12,  61,  78]. 

Chapter  3  of  this  thesis  explores  the  problem  of  arbitrating  among  modules  in  a  digital 
system.  Many  arbitration  mechanisms  have  been  developed  that  use  daisy  chains,  central¬ 
ized  priority  circuits,  polling  mechanisms,  token  passing  schemes,  and  carrier  sense  multiple 
access  protocols,  among  others  (see  [12,  16,  22,  40,  45,  46,  57,  61,  78,  82,  88]).  We  focus  on 
distributed  priority  arbitration  mechanisms,  where  contention  is  resolved  using  predetermined 
module  priorities  and  arbitration  processes  are  carried  out  in  a  distributed  manner  by  sys¬ 
tem  modules.  Distributed  priority  arbitration  mechanisms  are  used  in  many  modern  systems, 
including  numerous  multiprocessors  and  data  communication  networks.  Specifically,  we  inves¬ 
tigate  arbitration  mechanisms  that  employ  dedicated  arbitration  busses  and  use  asynchronous 
globally  or  locally  timed  combinational  logic.  Several  other  studies  of  bus-based  arbitration 
mechanisms  appear  in  [3,  22,  23,  24,  47,  71,  79,  80,  81]. 

In  Chapter  3,  we  examine  distributed  asynchronous  priority  arbitration  mechanisms  that 
arbitrate  among  n  modules  using  m  arbitration  busses,  for  lgn  <  m  <  n.  A  widely  used 
distributed  asynchronous  mechanism  is  the  binary  arbitration  scheme  [79],  which  with  m  =  lg  n 
busses  arbitrates  in  t  =  lgn  units  of  time.  We  present  a  new  asynchronous  scheme  —  binomial 
arbitration  —  that  by  using  m  =  lgn  +  1  busses  reduces  the  arbitration  time  to  t  =  |lgn. 
Extending  this  result,  we  present  the  generalized  binomial  arbitration  scheme  that  achieves 
a  bus-time  tradeoff  of  the  form  m  =  ©(tn1/*),  between  the  number  of  arbitration  busses  m 
and  the  arbitration  time  t  (in  units  of  bus-settling  delay),  for  values  of  lgn  <  m  <  n  and 
1  <  f  <  lgn.  Our  schemes  are  based  on  a  novel  analysis  of  data-dependent  delays.  Most 
importantly,  our  schemes  can  be  adopted  with  no  changes  to  existing  hardware  and  protocols; 
they  merely  involve  selecting  a  good  set  of  priority  arbitration  codewords.  We  also  investigate 
the  capabilities  of  general  asynchronous  priority  arbitration  schemes  that  employ  busses  and 
present  some  lower  bound  arguments  that  demonstrate  the  efficiency  of  our  schemes. 
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1.2.3  Transmission  lines 

The  speed  of  information  transfer  through  a  communication  medium  is  bounded  by  several 
physical  properties  of  the  medium.  Different  media  such  as  radio  broadcast  channels,  electrical 
wires,  and  optical  fibers  have  different  propagation  speeds,  but  they  can  all  be  modeled  essen¬ 
tially  in  the  same  manner.  In  any  communication  system,  the  information  sent  by  a  module 
requires  time  to  propagate  and  reach  other  modules.  Communication  protocols  must,  therefore, 
account  for  signal  propagation  by  incorporating  appropriate  time  intervals. 

In  Chapters  3  and  4,  we  investigate  how  propagation  delays  of  digital  signals  on  electrical 
busses  can  influence  the  design  of  communication  protocols.  The  propagation  of  a  signal  on 
an  electrical  bus  depends  on  the  length,  shape,  and  other  properties  of  the  bus.  A  high-speed 
bus  is  modeled  as  an  analog  transmission  line  with  associated  impedance  that  determines  the 
propagation  speed  of  signals  along  it  (see  [5,  40]).  Most  computer  systems,  however,  use  the 
digital  abstraction,  which  specifies  certain  discrete  voltage  levels  for  representing  logic  values. 
When  designing  communication  protocols  for  electrical  busses,  signal  propagation  delays  must 
be  accounted  for,  as  done,  for  example,  in  Ethernet  [63].  A  common  method  of  dealing  with 
different  and  unpredictable  propagation  delays  on  a  shared  medium  is  to  allow  sufficient  time 
for  the  propagation  of  signals  from  the  furthest  module  in  the  system  and  for  the  settlement  of 
the  communication  medium.  This  approach  is  explored  in  Chapter  3,  where  the  time  required 
by  bus-based  arbitration  mechanisms  to  stabilize  is  measured  in  units  of  bus-settling  delay. 
The  unit  of  a  bus-settling  delay  is  an  upper  bound  on  the  time  that  an  electrical  bus  resolves 
various  transient  effects  and  reaches  a  valid  logic  value.  In  Chapter  4,  on  the  other  hand,  we 
investigate  a  more  elaborate  model  of  a  bus  as  a  digital  transmission  line,  which  takes  into 
account  propagation  of  signals  along  a  bus  line  but  ignores  the  analog  nature  of  the  signals. 

In  Chapter  4,  we  examine  the  performance  of  priority  arbitration  schemes  presented  in 
Chapter  3  under  the  digital  transmission  line  bus  model.  This  bus  model  accounts  for  the 
propagation  time  of  signals  along  bus  lines  and  assumes  that  the  propagating  signals  are  always 
valid  digital  signals.  A  widely  held  misconception  is  that  in  the  digital  transmission  line  model 
the  arbitration  time  of  the  binary  arbitration  scheme  is  at  most  4  units  of  bus-propagation  delay. 
We  formally  disprove  this  conjecture  by  demonstrating  that  the  arbitration  time  of  the  binary 
arbitration  scheme  is  heavily  dependent  on  the  arrangement  of  the  arbitrating  modules  in  the 
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system.  We  provide  a  general  scenario  of  module  arrangement  on  m  busses,  for  which  binary 
arbitration  takes  at  least  m/2  units  of  bus-propagation  delay  to  stabilize.  We  also  prove  that 
for  general  arrangements  of  modules  on  m  busses,  binary  arbitration  settles  in  at  most  m/ 2  +  2 
units  of  bus-propagation  delay,  while  binomial  arbitration  settles  in  at  most  m/4  +  2  units  of 
bus- propagation  delay,  thereby  demonstrating  the  superiority  of  binomial  arbitration  for  general 
arrangements  of  modules  under  the  digital  transmission  line  model.  For  linear  arrangements  of 
modules  in  increasing  order  of  priorities  and  equal  spacings  between  modules,  we  show  that  3 
units  of  bus-propagation  delay  are  necessary  for  binary  arbitration  to  settle,  and  we  sketch  an 
argument  that  3  units  of  bus- propagation  delay  are  also  asymptotically  sufficient. 


Chapter  2 


Bussed  Permutation  Architectures 


This  chapter  explores  the  problem  of  efficiently  permuting  data  stored  in  VLSI  chips  in  accor¬ 
dance  with  a  predetermined  set  of  permutations.  By  connecting  chips  with  bussed  interconnec¬ 
tions,  as  opposed  to  point-to-point  interconnections,  we  show  that  the  number  of  pins  per  chip 
can  often  be  reduced.  For  example,  for  infinitely  many  n,  we  exhibit  permutation  architectures 
with  \y/n]  pins  per  chip  that  can  realize  any  of  the  n  cyclic  shifts  on  n  chips  in  one  clock 
tick.  When  the  set  of  permutations  forms  a  group  with  p  elements,  any  permutation  in  the 
group  can  be  realized  in  one  clock  tick  by  an  architecture  with  0(\/p  In  p )  pins  per  chip.  When 
the  permutation  group  is  abelian,  we  show  that  0(^/p)  pins  suffice.  These  results  are  all  de¬ 
rived  from  a  mathematical  characterization  of  uniform  permutation  architectures  based  on  the 
combinatorial  notion  of  a  difference  cover.  We  investigate  properties  of  difference  covers  and 
describe  procedures  for  designing  efficient  difference  covers  for  many  classes  of  permutation  sets. 
We  also  consider  uniform  permutation  architectures  that  realize  permutations  in  several  clock 
ticks,  instead  of  one,  and  show  that  further  savings  in  the  number  of  pins  per  chip  can  be  ob¬ 
tained.  Our  methodology  of  using  difference  covers  for  designing  efficient  uniform  architectures 
is  applicable  to  a  wide  range  of  permutation  networks,  including  hypercubes,  multidimensional 
meshes,  and  shuffle-exchange  networks. 


This  chapter  describes  joint  research  with  Joe  Kilian  and  Charles  Leiserson  [46]  and  [49], 
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2.1  Introduction 

The  organization  of  communication  among  chips  is  a  major  concern  in  the  design  of  an  electronic 
system.  Because  of  the  costs  associated  with  wiring  and  packaging,  it  is  generally  desirable 
to  minimize  the  number  of  wires  and  the  number  of  pins  per  chip  in  an  architecture.  Much 
research  has  focused  on  point-to-point  and  multistage  interconnections  (see  [6,  7,  27,  37,  75.  86]). 
In  this  chapter,  we  investigate  how  busses  can  be  employed  to  efficiently  implement  various 
communication  patterns  among  a  set  of  chips.  Other  studies  of  bussed  interconnection  schemes 
for  realizing  communication  patterns  can  be  found  in  [10,  11,  13,  30,  54,  64,  77]. 

Perhaps  the  simplest  example  of  the  advantage  of  bussed  interconnections  is  the  use  of 
a  single  shared  bus  to  communicate  between  any  pair  of  chips  connected  to  the  bus  in  one 
clock  tick.  Communicating  between  any  pair  of  chips  in  one  clock  tick  can  be  implemented 
with  two-pin  wires,  but  any  such  scheme  requires  (j)  wires  and  n  -  1  pins  per  chip,  where  n 
is  the  number  of  chips  in  the  system.1  Of  course,  a  two-pin  (point-to-point)  interconnection 
scheme  may  be  able  to  implement  more  communication  patterns,  but  if  we  are  only  interested 
in  communication  between  individual  pairs,  the  additional  power,  which  comes  at  a  high  cost, 
is  wasted. 

An  example  that  better  illustrates  the  ideas  in  this  chapter  comes  from  the  problem  of 
building  a  fast  cyclic  shifter  (sometimes  called  a  barrel  shifter)  on  n  chips.  Initially,  each  chip  c 
contains  a  one- bit  value  ec.  The  function  of  the  shifter  is  to  move  each  bit  ec  to  chip  c+s  (mod  n) 
in  one  clock  tick,  where  s  can  be  any  value  between  0  and  n  —  1. 

Any  cyclic  shifter  that  uses  only  two-pin  wires  requires  at  least  (j)  wires  and  n  —  1  pins  per 
chip  in  order  to  shift  in  one  clock  tick  because  each  chip  must  be  able  to  communicate  directly 
with  each  of  the  other  n  -  1  chips.  Using  busses,  however,  we  can  do  much  better.  Figure  2-1 
gives  an  architecture  for  a  cyclic  shifter  on  13  chips  which  uses  13  busses  and  only  4  pins  per 
chip.  To  realize  a  shift  by  8,  for  example,  each  chip  writes  its  bit  to  pin  3  and  reads  from  pin  1. 
The  reader  may  verify  that  all  other  cyclic  shifts  among  the  chips  are  possible  in  one  clock  tick. 
(In  Section  2.4,  we  give  a  general  method  for  constructing  such  cyclic  shifters  based  on  finite 
projective  planes.) 

'Unless  otherwise  specified,  we  connt  only  data  pins  in  our  analysis  and  omit  consideration  of  the  pins  for 
control,  clock,  power,  and  gronnd  since  they  are  needed  by  all  implementations. 
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Figure  2-1:  A  cyclic  shifter  on  13  chips  that  uses  13  busses.  Each  chip  has  4  pins,  and  each  bus  has  4 
chips  connected  to  it.  This  cyclic  shifter  is  based  on  the  difference  cover  {0, 1,3,9}  for  Z13. 

The  cyclic  shifter  of  Figure  2-1  has  the  advantage  of  uniformity.  All  chips  have  exactly  the 
same  number  of  pins,  and  to  accomplish  each  of  the  13  permutations  specified  bv  the  problem, 
all  chips  write  to  (and  read  from)  pins  with  identical  labels.  For  all  busses,  the  number  of  pins 
per  bus  is  4,  which  is  the  same  as  the  number  of  pins  per  chip.  Moreover,  the  connections 
between  chips  and  busses  follow  a  periodic  pattern.  The  uniformity  of  the  architecture  leads  to 
simplicity  in  the  control  of  the  system.  Four  control  wires  from  a  central  controller  are  sufficient 
to  determine  each  of  the  13  shifts — two  wires  for  specifying  the  number  of  the  pin  on  which  to 
write,  and  two  for  the  pin  to  read — which  is  the  minimum  possible.  Thus,  our  control  scheme 
uses  the  minimum  number  of  control  pins,  and  the  on-chip  decoding  logic  is  straightforward 
and  identical  for  all  the  chips. 

Cyclic  shifters  for  general  n  can  be  constructed  using  an  idea  from  combinatorial  mathe¬ 
matics  related  to  difference  sets  [43,  p.  121].  (See  also  [14,  34,  38,  56,  66].) 

Definition  1  A  subset  D  C  Z„  of  the  integers  modulo  n  is  a  difference  cover  for  Z„  if  for  all 
s  G  Zn,  there  exist  d,,  d3  €  D  such  that  s  =  d,  -  d}  (mod  n). 


That  is,  every  integer  in  Z„  can  be  represented  as  the  difference  modulo  n  of  two  integers  in 
D.  For  example,  the  set  D  =  {0, 1,3,9}  is  a  difference  cover  for  Z13,  since 


W  © 


24 


CHAPTER  2.  BUSSED  PERMUTATION  ARCHITECTURES 


0  =  0-0 

1  =  1-0 

2  =  3-1 

3  =  3-0 

4  =  0-9 

5  =  1-9 

6  =  9-3 

7  =  3-9 

8  =  9-1 

9  =  9-0 

10  =  3-3 

11  =  1-3 

12  =  0  -  1  , 

where  all  subtractions  are  performed  modulo  13 

Given  a  difference  a  iur  Zn  with  k  elements,  a  cyclic  shifter  on  n  chips  with  n  busses  and 
k  pins  per  clap  can  b  .or  cted.  Suppose  D  =  {do* dj, . . . , d*_ i}  is  a  difference  cover  for  Z„. 
In  the  cyclic  shifter,  chip  c  connects  via  its  pin  i  to  bus  c  +  d,  (mod  n),  for  all  c  =  0, 1, . . . ,  n  -  1 
and  t  =  0, 1,. .  .,k  -  1.  To  see  that  any  cyclic  shift  on  the  n  chips  can  be  uniformly  realized, 
onsider  a  cyclic  s'  '  by  s.  Since  D  is  a  difference  cover  for  Zn,  there  exist  dj,dj  €  D  such  that 
s  -  4i  -  dj  (modn).  To  realize  the  shift  by  s,  each  chip  writes  to  pin  i  and  reads  from  pin  j. 
Chip  therefore  writes  onto  bus  c  +  d,,  and  bus  c  +  d,  is  read  by  chip  (c  +  dj)  —  dj  =  c  +  s.  No 
collisions  occur  because  each  bus  has  exactly  one  pin  labeled  t  and  one  pin  labeled  j  connected 
to  it,  as  can  be  verified. 

The  remainder  of  this  chapter  explores  permutation  architectures,  the  properties  of  multiple- 
pin  interconnections,  and  related  combinatorial  mathematics.  In  Section  2.2,  we  define  a  per¬ 
mutation  architecture,  introduce  the  notion  of  uniformity,  and  prove  some  basic  properties  of 
architectures  that  employ  busses  to  realize  arbitrary  sets  of  permutations.  Section  2.3  defines 
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the  notion  of  a  difference  cover  for  a  set  of  permutations,  relates  it  to  the  notion  of  a  uniform 
permutation  architecture,  and  proves  some  properties  of  difference  covers.  In  Section  2.4,  we 
show  how  to  build  cyclic  shifters  that  are  provably  efficient.  Section  2.5  investigates  how  to 
design  small  difference  covers  for  any  set  of  permutations  that  forms  a  finite  group.  In  Sec¬ 
tion  2.6,  we  extend  the  discussion  to  uniform  architectures  that  realize  permutations  in  more 
than  one  clock  tick.  Several  applications  and  extensions  of  bussed  permutation  architectures 
are  discussed  in  Section  2.7,  as  well  as  further  research  and  some  questions  left  open  by  our 
research. 

2.2  Permutation  architectures 

In  this  section  we  formally  define  the  notion  of  a  permutation  architecture,  and  we  make  precise 
the  notion  of  uniformity.  We  also  prove  some  basic  properties  of  permutation  architectures  that 
realize  arbitrary  sets  of  permutations.  The  definitions  in  this  section  are  somewhat  intricate 
and  tedious,  and  are  indicative  of  the  difficulties  faced  in  the  design  of  efficient  permutation 
architectures.  In  the  next  section,  however,  we  use  these  definitions  to  show  that  reasoning 
about  uniform  permutation  architectures  is  essentially  equivalent  to  reasoning  about  difference 
covers,  a  simpler  and  more  elegant  mathematical  notion.  The  remainder  of  this  chapter  then 
uses  the  simpler  notion. 

For  convenience,  we  adopt  a  few  notational  conventions.  We  use  multiplicative  notation 
to  denote  composition  of  permutations.  The  inverse  of  a  permutation  x  is  denoted  by  x_I. 
Composition  of  functions  is  performed  in  right-to-left  order,  so  that  xj x2  is  defined  by  xix2z  = 
^(^(z))-.  The  identity  permutation  on  n  elements  is  denoted  by  /n,  or  by  I  if  the  number 
of  elements  is  unimportant.  For  a  permutation  set  $,  we  denote  by  $_1  the  set  of  all  the 
inverses  of  the  permutations  of  $,  i.e.,  $_l  =  { <f>~ 1  :  <f>  €  $}.  For  two  permutation  sets  $  and 
¥ ,  the  notation  $¥  is  used  to  denote  the  permutation  set  :  <t>  €  ♦  and  0  €  4'}.  We  use  the 
notation  [n]  to  denote  the  set  of  n  integers  {0,  l,...,n  —  1}. 

2.2.1  What  is  a  permutation  architecture? 

We  begin  by  formally  defining  the  notion  of  a  permutation  architecture. 
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Definition  2  A  permutation  architecture  is  a  6-tuple  A  =  (C,  B,  P,  CHIP,  BUS,  label)  as 
follows. 

1.  C  is  a  set  of  chips-, 

2.  B  is  a  set  of  busses; 

3.  P  is  a  set  of  pins; 

4.  chip  is  a  function  CHIP  :  P  — »  C; 

5.  bus  is  a  function  BUS  :  P  — »  B; 

6.  LABEL  is  a  function  label  :  P  ->  N,  where  if  x,y  €  P,  z  ^  y,  and  chip(x)  =  CHiP(y), 
then  label(x)  ^  label(j/). 

The  set  C  contains  all  the  chips  in  the  architecture,  and  the  set  B  contains  all  the  busses. 
Which  chips  are  connected  to  which  busses  is  determined  by  the  pins  they  have  in  common; 
the  set  P  contains  all  the  pins.  The  function  CHIP  determines  which  pins  belong  to  which 
chips.  Similarly,  the  function  bus  determines  which  pins  are  interconnected  by  which  bus.  The 
function  LABEL  names  the  pins  on  the  chips  by  natural  numbers  such  that  all  pins  on  a  given 
chip  have  distinct  labels,  which  we  shall  sometimes  call  pin  numbers. 

Our  formal  definition  of  a  permutation  architecture  omits  several  subsystems  that  techni¬ 
cally  should  be  included,  but  whose  inclusion  is  not  germane  to  our  study.  These  subsystems 
include  a  control  network  that  specifies  what  permutation  is  to  be  performed  and  clocking 
circuitry  for  synchronization.  Our  focus  is  on  the  structure  of  the  bussed  interconnections  for 
permuting  the  data,  and  thus  our  definition  encompasses  only  this  aspect  of  the  architecture. 
We  now  define  what  it  means  for  a  permutation  architecture  to  realize  a  permutation. 

Definition  3  A  permutation  architecture  A  =  (C,  B,  P,  chip,  bus,  label)  realizes  a  permuta¬ 
tion  ir  :  C  — ►  C  if  there  exist  two  functions  write*  :  C  — *  P  and  read*  :  C  -*  P,  such  that 
for  any  chips  e,ci,C2  6  C,  we  have: 

1.  CHIP(READ*(c))  =  CHIP(WRITE*(c))  =  C; 

2.  bus(write*(c))  =  bus(read*(jt(c))); 
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3.  ci  5*  c2  implies  bus(write,(ci))  /  bus(write,(c2)). 

The  architecture  uniformly  realizes  *  if,  in  addition: 

4.  label(write,(ci))  =  label(write,(c2)); 

5.  label(read,(cj))  =  label(read,(c2)). 

We  say  that  a  permutation  architecture  realizes  a  set  II  of  permutations  if  it  realizes  every 
permutation  in  II.  We  say  that  it  uniformly  realizes  II  if  it  uniformly  realizes  every  permutation 
in  II. 

Intuitively,  for  a  permutation  x,  the  functions  write,  and  READ,  identify  the  write  pin 
and  the  read  pin  for  each  chip.  Condition  1  makes  sure  that  each  chip  writes  and  reads  pins 
that  are  connected  to  it.  Condition  2  ensures  that  the  bus  to  which  chip  c  writes  is  read  by 
chip  x(e).  Condition  3  guarantees  that  no  collisions  occur,  that  is,  no  two  data  transfers  use 
the  same  bus.  The  architecture  uniformly  realizes  a  permutation  (Conditions  4  and  5)  if  all 
chips  write  to  pins  with  the  same  pin  number  and  read  from  pins  with  the  same  pin  number, 
as  in  the  cyclic  shifter  from  Figure  2-1. 

Our  definition  of  a  permutation  architecture  implies  that  “complete”  permutations  are  to  be 
realized,  that  is,  every  chip  sends  exactly  one  datum  and  receives  exactly  one  datum.  Moreover, 
an  interconnection  is  required  even  when  a  chip  sends  a  datum  to  itself.  Since  no  collisions  occur, 
the  number  of  busses  in  the  architecture  must  be  at  least  the  number  of  chips.  This  observation 
leads  directly  to  the  following  theorem. 

Theorem  1  In  any  permutation  architecture  that  realizes  some  nonempty  permutation  set  II, 
the  average  number  of  pins  per  bus  is  at  most  the  average  number  of  pins  per  chip. 

Proof.  Let  A  =  (C,  B,  P,  CHIP,  Bus,  label)  be  a  permutation  architecture  for  II.  The  average 
number  of  pins  per  chip  is  |P)  /  |C|,  and  the  average  number  of  pins  per  bus  is  \P\  /  |2?|.  Condi¬ 
tion  3  of  Definition  3  says  that  for  any  permutation  x  €  II,  any  two  distinct  chips  are  mapped 
to  distinct  busses.  Consequently,  we  get  that  |5|  >  |C|,  which  proves  the  theorem.  H 

Under  the  assumption  that  no  interconnection  is  needed  for  a  chip  to  send  data  to  itself. 
Theorem  1  is  no  longer  applicable.  A  similar  theorem  can  be  proved  for  this  model,  however, 
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which  involves  the  number  of  fixed  points  in  the  permutations  realized  by  the  architecture. 
Specifically,  suppose  the  architecture  realizes  a  set  II  of  permutations.  Define  the  rank  of  a 
permutation  r  €  II  as  rank(x)  =  | {c  £  C  :  x(c)  ^  c}|,  and  define  the  rank  of  the  permutation 
set  II  as  RANK(II)  =  maxjrgn  Rank(jt).  The  analogue  to  Theorem  1  states  that  the  ratio 
between  the  average  number  of  pin6  per  bus  and  the  average  number  of  pins  per  chip  is  at  most 
|C| /rank(II). 

2.2.2  Uniform  permutation  architectures 

In  any  architecture  A  that  uniformly  realizes  a  permutation  set  II,  the  number  of  pins  that  are 
actually  used  to  uniformly  realize  II  is  the  same  for  all  chips,  and  additional  pins  on  a  chip 
are  unused.  Furthermore,  the  number  of  busses  used  in  realizing  any  permutation  r  g  II  is 
equal  to  the  number  of  chips.  These  observations  lead  to  the  following  definition  of  a  uniform 
architecture. 

Definition  4  A  uniform  permutation  architecture  for  a  permutation  set  II  is  a  permutation 
architecture  A  =  (C,i?,P, chip,  bus,  label)  such  that: 

1.  A  uniformly  realizes  II; 

2.  |{x  €  P  :  CHlP(x)  =  ci } |  =  |{x  €  P  :  CHIP(x)  =  cj}|  for  any  two  chips  C\,cj  €  C; 

3-  \B\  =  |C|; 

4.  if  x  jt  y  and  label(x)  =  lABEl(y),  then  Bus(x)  ^  BUs(y). 

Thus,  all  the  chips  in  a  uniform  permutation  architecture  have  the  same  number  of  pins  (Con¬ 
dition  2),  the  number  of  busses  is  equal  to  the  number  of  chips  (Condition  3),  and  the  labels  of 
the  pins  on  any  bus  are  distinct  (Condition  4). 

The  following  theorem  demonstrates  that  any  permutation  architecture  that  uniformly  re¬ 
alizes  some  permutation  set  II  can  be  made  into  a  uniform  architecture  for  II. 

Theorem  2  Let  A  =  (C,  B,  P,  CHIP,  BUS,  LABEL)  be  a  permutation  architecture  that  uniformly 
realizes  the  permutation  set  II,  and  let  k  be  the  smallest  number  of  pins  on  any  chip  in  C.  Then 
there  is  a  uniform  architecture  A!  -  (C\  B\  P\  CHIP',  BUS',  LABEL')  for  II  t pith  at  most  k  ptns 
per  chip. 
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Proof.  We  construct  the  uniform  architecture  A'  from  the  permutation  architecture 
A  in  two  steps.  First,  we  construct  an  intermediate  permutation  architecture 

A"  =  (C'\  B",  P",  chip",  bus",  label")  by  removing  extraneous  pins  from  chips  in  A  such 
that  all  chips  end  up  with  the  same  number  of  pins  per  chip  and  such  that  each  pin  plays  a  role 
in  uniformly  realizing  II.  Then,  the  busses  of  A"  are  reorganized  to  produce  the  architecture 
A'  in  such  a  way  that  the  number  of  busses  in  A'  is  equal  to  the  number  of  chips.  We  assume 
that  the  permutation  set  II  is  nonempty,  since  otherwise  the  theorem  is  trivial. 

In  the  first  step,  we  remove  pins  that  are  unused  in  uniformly  realizing  II.  Since  A  uniformly 
realizes  II,  each  permutation  it  6  II  can  be  associated  with  a  distinct  pair  («,  j)  of  pin  labels 
corresponding  to  the  labels  that  all  chips  write  to  and  read  from  in  order  to  realize  x.  A  pin  is 
unused  if  its  label  does  not  appear  in  any  of  these  jll|  pairs.  Removing  the  unused  pins  results 
in  the  architecture  A"  in  which  all  chips  have  the  same  number  of  pins,  since  each  chip  has 
exactly  one  pin  for  each  label  used  in  uniformly  realizing  II.  The  permutation  architecture  A" 
uniformly  realizes  II,  and  furthermore,  each  pin  is  used  in  uniformly  realizing  some  x  £  II.  If 
we  let  s  denote  the  number  of  pins  per  chip  in  A",  then  we  have  s  <  k,  since  originally  at  least 
one  chip  had  k  pins  and  no  pins  were  added. 

In  the  second  step,  we  reorganize  the  busses  of  A"  to  produce  the  uniform  architecture  A'  in 
which  the  number  of  busses  is  equal  to  the  number  of  chips.  For  any  permutation  architecture 
that  realizes  a  nonempty  permutation  set,  the  number  of  busses  is  never  smaller  that  the  number 
of  chips.  Assume  without  loss  of  generality  that  C"  =  [n],  B"  =  [m],  and  range( LABEL")  =  [_«]. 
The  theorem  is  proved  if  the  architecture  A"  uses  only  n  =  \C"\  busses,  but  in  general,  the 
architecture  might  use  m  >  n  busses. 

We  define  a  collection  of  mappings  ¥  =  where  for  each  0  <  i  <  s  -  1, 

the  mapping  fa  :  [n]  — *■  [m]  is  defined  to  be  fa(c)  =  6  if  and  only  if  chip  c  €  C"  is  connected 
via  its  pin  number  »  to  bus  b  £  B" .  The  elements  of  ♦  are  indeed  mappings  since  each  chip 
has  a  pin  numbered  »  for  each  0  <  i  <  3  -  1.  The  mappings  are  injective  (one-to-one),  since 
otherwise  two  pins  with  the  same  pin  number  would  be  connected  to  the  same  bus,  and  both 
pins  could  not  be  used  to  uniformly  realize  permutations,  thereby  violating  the  construction 
of  A"  in  the  first  step.  The  collection  ¥  is  a  multiset,  since  it  may  be  that  two  different  pin 
numbers  i  ^  j  define  the  same  mapping  (i.e.,  fa  =  ^;).  The  key  idea  is  that  any  permutation 
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is  implemented  by  each  chip  writing  to  pin  i  and  reading  from  pin  j,  thereby  employing  the 
mapping  fa  to  write  data  from  the  n  chips  to  n  distinct  busses,  and  the  inverse  of  the  mapping 
rl>}  to  read  data  from  the  same  n  busses  back  to  the  n  chips. 

We  now  show  how  to  reorganize  the  busses  of  A"  in  order  to  construct  a  uniform  architecture 
A'.  We  partition  $  into  /  equivalence  classes  t0U#iU'"U  ¥/_i  such  that  fa  and  fa  are 
in  the  same  equivalence  class  ¥r,  if  and  only  if  ra.nge(fa)  =  range(0j).  This  partitioning  has 
the  property  that  if  *  €  II,  then  there  exists  an  r  such  that  r  =  \bjlfa  where  fa,  fa  €  ¥r. 
(Recall  that  the  inverse  of  an  injective  mapping  xb  :  [n]  — ►  [m]  is  defined  as  the  mapping 
V-1  :  range(^)  — *  [n]  such  that  if  fac)  =  b ,  then  xb~l(b)  =  c.)  For  each  0  <  r  <  /  —  1,  pick  a 
bijection  (one-to-one,  onto)  fT  :  range(v)  — 1 «  [n],  where  xb  is  any  mapping  in  (We  can  pick 
a  bijection,  since  V’  is  injective,  which  implies  |range(0)|  =  n.)  We  define  the  architecture  A! 
by  C'  =  C",  B'  =  [n],  P'  =  P",  chip'  =  chip",  label'  =  label",  and  for  any  pin  x  €  P'  such 
that  V>LABEL'(r)  6  *’■’  We  define  BUS'(*)  =  /r(B0S"(l)). 

The  architecture  A!  has  exactly  s  pins  per  chip  and  satisfies  |P'|  =  \C'\  =  n,  thereby 
satisfying  Conditions  2  and  3  of  Definition  4.  We  show  Condition  4  holds  by  considering  any 
two  pins  i  and  y  with  LABEL'(i)  =  LABEL'(y)  =  ».  We  have  Bus'(z)  =  /r(BUS"(i))  and 
Bus'(y)  =  /r(BUs"(y))  for  some  fT  as  defined  in  the  previous  paragraph.  Since  fT  is  an  injective 
mapping  and  because  Condition  4  of  Definition  4  holds  for  A",  we  then  have  x  ^  y  implies 
bus'(z)  ^  Bus'(y). 

It  remains  to  show  that  Condition  1  of  Definition  4  holds,  that  is,  that  A!  uniformly  realizes 
If.  Consider  any  permutation  r  €  II.  Since  A"  uniformly  realizes  II,  there  exists  a  pair  of  pin 
labels  (i,j)  sufh  that  ir  is  realized  in  A"  by  each  chip  writing  to  its  pin  numbered  t  and  reading 
from  its  pin  numbered  j.  We  use  the  same  pin  labels  («,j)  to  realize  the  permutation  ir  in  A'. 
Conditions  1,  4,  and  5  of  Definition  3  are  immediately  satisfied.  To  verify  Conditions  2  and  3 
we  use  the  following  observation.  In  architecture  A"  chip  c  is  connected  via  its  pin  labeled  h  to 
bus  xbh{c),  while  in  architecture  A'  it  is  connected  to  bus  /r(tMc)),  where  fa  €  Condition 
2  now  holds  since  r  =  xl>~lfa  =  ( frfa)~l{fTfa )•  Condition  3  holds  since  frfa  is  a  permutation 
on  [n].  We  therefore  conclude  that  A"  is  a  uniform  architecture  for  II  with  at  most  k  pins  per 
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2.2.3  Some  properties  of  uniform  architectures 

From  the  definition  of  uniform  permutation  architectures  one  can  derive  several  structural 
properties  of  these  architectures.  The  next  theorem  provides  a  lower  bound  on  the  number  of 
pins  per  chip  in  any  uniform  architecture  for  a  permutation  set  II. 

Theorem  3  Let  A  =  (C,  B,  P,  CHIP,  BUS,  label)  be  a  uniform  permutation  architecture  for  a 
permutation  set  II.  Then  the  number  of  pins  per  chip  in  A  is  at  least  v/JITf. 

Proof.  Because  architecture  A  realizes  II  uniformly,  we  can  associate  each  jt  €  II  with  a  pair 
(i,j)  of  pin  numbers  such  that  r  is  realized  by  each  chip  writing  to  its  pin  labeled  i  and  reading 
from  its  pin  labeled  j.  Since  A  is  uniform,  each  chip  has  exactly  jP|  /  )C|  pins,  and  the  number 
of  such  pairs  is  (|P|  /  |C|)2.  No  two  permutations  can  be  associated  with  the  same  pair,  and 
thus,  we  have  (|P|  /  |C|)2  >  |II|  or  |P|  /  |C(  >  \/[ff[.  ® 

Another  observation  made  by  Fiduccia  [28]  involves  the  maximal  number  of  chips  reachable 
in  one  dock  tick  from  any  given  chip  in  a  uniform  architecture.  (See  also  [48,  p.  308].) 

Theorem  4  Any  uniform  permutation  architecture  with  k  pins  per  chip  has  exactly  k  pins  per 
bus,  and  each  chip  is  connected  to  at  most  k{k  —  1)  other  chips. 

Proof.  If  there  is  a  bus  with  more  than  k  pins,  then  two  pins  on  the  bus  must  have  the 
same  label,  contradicting  Condition  4  of  Definition  4.  Now,  since  for  uniform  architectures  the 
number  of  busses  is  equal  to  the  number  of  chips,  each  bus  must  have  exactly  k  pins.  Moreover, 
since  any  chip  is  connected  to  at  most  k  different  busses  (via  its  k  pins),  each  of  which  is 
connected  to  no  more  than  Jfc  -  1  other  chips,  the  number  of  neighbors  of  a  chip  is  at  most 
Jfc(fc- 1).  ■ 

A  permutation  architecture  can  often  nonuniformly  realize  many  more  permutations  than 
the  square  of  the  number  of  pin6  per  chip.  As  an  example,  consider  a  “crossbar”  architecture 
of  n  chips  and  r»  busses  where  each  chip  is  connected  to  each  bus.  This  architecture  can 
nonuniformly  realize  all  n!  permutations,  which  is  much  greater  than  n2,  the  square  of  the 
number  of  pins  per  chip.  In  Section  2.7.3  we  discuss  some  of  the  capabilities  of  nonuniform 
permutation  architectures. 
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2.3  Difference  covers 

In  this  section,  we  present  our  main  theorems  which  establish  the  relationship  between  differ¬ 
ence  covers  for  permutation  sets  and  uniform  permutation  architectures.  We  also  prove  some 
theorems  concerning  the  design  of  general  difference  covers  and  difference  covers  for  Cartesian 
products  of  permutation  sets.  Finally,  we  present  an  alternative  representation  for  difference 
covers  called  substring  covers  based  on  similar  notions  in  the  literature  of  difference  sets. 

2.3.1  Difference  covers  and  uniform  architectures 

We  first  provide  a  generalization  of  Definition  1  to  arbitrary  sets  of  permutations. 

Definition  5  A  difference  cover  for  a  permutation  set  II  is  a  6et  ♦  =  of 

permutations  such  that  for  each  x  €  II  there  exist  €  ♦  such  that  x  =  4>~x4>\- 

Equivalently,  we  can  use  our  product-of-sets  notation  to  say  that  ♦  is  a  difference  cover  for  II 

if*-1*  2  n. 

The  following  theorems  show  how  difference  covers  and  uniform  architectures  are  related. 
Theorem  5  describes  how  to  design  a  uniform  architecture  for  a  permutation  set  II  when  a 
difference  cover  for  II  is  given.  Theorem  6  presents  a  construction  of  a  difference  cover  for  a 
permutation  set  II  from  a  uniform  architecture  for  II. 

Theorem  5  Let  II  be  a  permutation  set,  and  let  $  be  a  difference  cover  for  II  such  that  |$|  =  k. 
Then  there  exists  a  uniform  architecture  for  II  with  k  pins  per  chip. 

Proof.  Let  ♦  =  and  assume  that  II  is  a  set  of  permutations  on  n  objects. 

We  construct  *a  permutation  architecture  for  II  with  n  busses  and  k  pins  per  chip.  We  name 
the  chips  and  busses  of  the  architecture  by  natural  numbers,  and  the  pins  by  pairs  of  natural 
numbers.  The  architecture  A  =  (C,  B,  P,  chip,  bus,  label)  is  defined  as  C  =  [n],  B  =  [n], 
P  =  [n]  x  [*],  chip(c,  i)  =  c,  label(c,  i)  =  i,  and  BUS(c,i)  =  label(c,,)(ch,|,(c>‘))  =  4>i(c). 

That  is,  chip  c  is  connected  via  its  pin  number  t  to  bus  <j>i(c). 

To  see  formally  that  this  architecture  uniformly  realizes  II,  let  x  €  II  be  a  permutation,  and 
let  4>i,4>}  €  $  be  elements  of  the  difference  cover  for  II  such  that  x  =  <t>~l4>,.  Define  the  write 
function  for  x  as  write^c)  =  (c,i)  and  define  the  read  function  for  x  as  read»(c)  =  ( cj ). 


2.3.  DIFFERENCE  COVERS 


33 


(Note  that  i  and  j  are  always  in  the  range  0  through  k  -  1.)  We  now  verify  that  the  five 
Conditions  of  Definition  3  are  satisfied.  Condition  1  holds  since  for  any  chip  c  €  C  we  have 
chip(write,(c))  =  chip(c,  i)  =  c,  and  CHlP(READ„(c))  =  CHIP (c,j)  =  c.  Condition  2  is 
satisfied  since  for  any  chip  c  €  C  we  have 

BUS(WRITE,(c))  =  BUS(c,t) 


=  4(e) 

=  4>}4>Jl4>,(c) 

=  4(*(c)) 

=  Bus(jr(c),j) 

=  BUS(READ»(ir(c))). 


Condition  3  holds  because  if  bus(write»(cj))  =  BUS(write»(c2))  for  any  two  chips  ci,c2  €  C, 
then  we  have  4(ci)  =  4(c2),  which  implies  that  c2  =  c2,  since  <£,  is  invertible.  Conditions  4 
and  5  both  hold  since  LABEL(wRITE»(c))  =  i  and  LABEL(read»(c))  =  j  for  all  chips  c  €  C.  We 
therefore  conclude  that  the  architecture  A  uniformly  realizes  IT.  The  architecture  is  uniform, 
but  Theorem  2  obviates  the  need  to  show  this  fact.  H 

Given  a  difference  cover  of  small  cardinality,  Theorem  5  says  we  can  construct  a  uniform 
architecture  with  few  pins  per  chip.  In  fact,  the  reverse  is  true  as  well,  as  the  following  theorem 
shows. 


Theorem  6  Let  II  be  a  permutation  set,  and  let  A  be  a  uniform  architecture  for  II  with  k  pins 
per  chip.  J'hen  II  has  a  difference  cover  ♦  such  that  (♦)  <  k. 

Proof.  Given  a  uniform  architecture  A  =  (C,  B,  P,  CHIP,  BUS,  label)  for  the  permutation  set 
II,  where  k  is  the  number  of  pins  on  each  chip,  we  construct  a  difference  cover  $  for  II  as 
follows.  Assume  without  loss  of  generality  that  C  =  B  =  [n]  and  range(LABEL)  =  [fc].  For 
each  pin  number  i,  where  i  =  0, 1, . .  .,k  —  1,  we  define  4  by  4(c)  =  b  if  and  only  if  chip  c  is 
connected  via  its  pin  number  i  to  bus  b.  We  now  define  the  difference  cover  ♦  to  be  the  set 
$  =  {4>o,<t>\, . .  (The  set  $  may  have  less  than  k  elements,  since  some  permutations 

may  be  repeated  among  the  4  s-) 
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To  see  that  ♦  is  a  difference  cover  for  II,  consider  any  permutation  ir  €  II.  Since  A 
uniformly  realizes  t,  there  exists  a  pair  of  pin  labels  (i,j)  such  that  *  is  realized  by  each 
chip  writing  to  its  pin  numbered  t  and  reading  from  its  pin  numbered  j.  The  labels  i  and  j 
satisfy  j  =  label(write,(c))  and  j  =  label(read,(c))  for  all  chips  c  €  C,  as  follows  from 
Conditions  4  and  5  of  Definition  3.  Conditions  1  and  3  of  Definition  3  imply  that  4>,  and  4>:  are 
both  permutations,  and  therefore  there  are  6  $  such  that  <ph  =  4>i  and  4>i  —  <t>j -  Finally, 

Condition  2  of  Definition  3  implies  that  r  =  =  4>~X4>h,  which  proves  that  $  is  indeed  a 

difference  cover  for  II.  fl 

2.3.2  Designing  difference  covers 

Theorems  5  and  6  show  that  uniform  architectures  and  difference  covers  are  very  closely  related. 
Thus,  when  designing  a  uniform  permutation  architecture  for  a  set  of  permutations,  it  suffices 
to  focus  on  the  problem  of  constructing  a  good  difference  cover  for  that  set. 

We  first  present  a  simple  theorem  that  demonstrates  that  any  arbitrary  permutation  set  II 
has  a  difference  cover  of  size  at  most  |II|  +  1. 

Theorem  7  Let  II  be  an  arbitrary  permutation  set  on  n  elements.  Then  II  has  a  difference 
cover  of  size  at  most  |II|  +  1. 

Proof.  Define  4  =  llu  {/„}.  For  any  Jr  €  II,  we  have  x  =  7"1*,  where  x, In  e  4>.  Therefore, 
$  is  a  difference  cover  for  II,  and  |$|  <  |II|  +1.  B 

Theorem  7  presents  a  naive  construction  of  a  difference  cover  for  an  arbitrary  permutation 
set  II.  In  general,  the  bound  of  Theorem  7  cannot  be  improved  without  specific  knowledge  about 
the  structure  of  the  permutation  set  involved.  In  [30],  Fiduccia  describes  how  to  construct  a 
permutation  set  II  of  arbitrary  size,  for  which  no  difference  cover  of  cardinality  |II|  exists.  This 
shows  that  the  construction  of  Theorem  7  is  optimal  for  general  permutation  sets. 

Specific  knowledge  about  the  structure  of  a  permutation  set  can  indeed  be  helpful  in  ob¬ 
taining  a  small  difference  cover  for  it.  In  Sections  2.4  and  2.5,  we  investigate  the  construction  of 
difference  covers  for  cyclic  groups  of  permutations  and  for  groups  in  general.  Here,  we  examine 
permutation  sets  formed  by  Cartesian  products. 
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Definition  8  Let  III  be  a  set  of  permutations  from  X\  to  X\,  and  let  II2  be  a  set  of  permu¬ 
tations  from  Xj  to  Xj.  The  Cartesian  product  II  =  III  x  II2  is  the  set  of  permutations  from 
X\  x  Xi  to  X\  x  X2  defined  as  II  =  {(*i,t2) :  *i  €  IIi,x2  €  II2}.  Operations  on  the  elements 
of  II  are  performed  componentwise. 

The  Cartesian  product  III  x  II2  is  isomorphic  to  the  Cartesian  product  II2  x  TIi .  The 
Cartesian  product  II  =  III  x  II2  is  an  abelian  permutation  set  if  and  only  if  both  IIi  and  n2 
are  abelian  permutation  «ets. 

The  next  two  lemmas  provide  bounds  on  the  size  of  difference  covers  for  Cartesian  products 
of  permutation  sets. 

Lemma  8  Let  IIi  be  a  permutation  set  on  rti  objects,  and  let  II2  be  a  permutation  set  on  n2 
objects.  Then  the  Cartesian  product  II  =  IIj  x  XI2,  which  is  a  permutation  set  on  nj  -n2  objects, 
has  a  difference  cover  of  size  {TIi  |  +  |II2|. 

Proof.  Let  $  be  the  union  of  |(jrf1,/flJ) :  xj  €  Iljj  and  {(/„,, *2) :  Jr 2  €  II2}.  Each  permu¬ 
tation  t  =  (xj,t2)  €  n,  can  be  represented  as  (x!,?r2)  =  (rf1,  Jnj)_1  •  (/„,,jt2),  where  both 
(irf  \  Jnj)  and  (/„, ,  x2)  are  in  $.  Thus  $  is  a  difference  cover  for  II,  and  the  size  of  $  is  exactly 

in,|  +  |n2|.  ■ 

Lemma  9  Let  Tlx  be  a  permutation  set  on  n\  objects  with  a  difference  cover  and  let  II2 
be  a  permutation  set  on  n2  objects  with  a  difference  cover  $2.  Then  the  Cartesian  product 
$  =  x  $2  is  a  difference  cover  for  II  =  IIi  x  II2. 

Proof.  For  each  it  =  (jti,*2)  €  II,  there  exist  €  $1  such  that  r j  =  ,  and 

there  exist  <k7,<t>n  €  $2  such  that  t2  =  d>"V.3-  We  then  have  (jr,,r2)  =  (<£~VM  ;,)  = 

wbere  both  (<£,,, and  are  in  #  =  *j  x  *2,  and  hence  $  is  a 

difference  cover  for  II.  H 

To  demonstrate  both  the  use  of  difference  covers  and  of  Lemma  9,  we  present  in  Fig¬ 
ure  2-2  a  uniform  permutation  architecture  due  to  Fiduccia  [28]  for  realizing  shifts  in 
a  two-dimensional  array.  The  architecture  uniformly  realizes  the  permutation  set  II  = 
{I,N,E,S,  W,NE,SE,NW,SW}  of  eight  compass  directions  plus  the  identity  I.  We  introduce 
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two  permutation  sets  II  i  =  {I,  N,S},  II2  =  {I,  E,  W},  and  corresponding  difference  covers 
$1  =  {I,  N}  and  $2  =  {I,  E}.  The  Cartesian  product  III  x  II2  is  II,  and  the  set  of  permutations 
$  =  #1  x  $2  =  {I,  E,  NE,  N}  is  a  difference  cover  for  II. 


Figure  2-2:  A  uniform  architecture  due  to  Fiduccia  [28]  baaed  on  the  difference  cover  {I,  E,  NE,  N}  for 
the  permutation  set  II  =  {I.N.E.S,  W,NE,SE,NW,SW}. 


2.3.3  Substring  covers:  an  alternative  notation 

We  conclude  this  section  by  defining  the  notion  of  a  substring  cover  for  a  permutation  set  II, 
which  is  equivalent  to  the  notion  of  a  difference  cover.  (A  similar  notion  for  difference  sets  is 
well  known  in  the  literature  [14,  66].) 

Definition  7  An  ordered  list  £  =  (<7o,<ri, . .  of  permutations  is  a  substring  cover  for  a 

permutation  set  II  if 

1.  <To<Ti  — <Tfc_  1  =  I,  and 
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2.  for  all  >r  €  II,  there  exist  0  <  i,j  <  k  —  1  such  that  it  =  <7,<7,+i  •  •  -dj,  where  the  arithmetic 
in  the  indices  is  performed  modulo  k. 

The  substring  cover  E  is  a  list  of  permutations  such  that  all  the  permutations  in  II  can  be 
represented  as  a  composition  of  a  substring  of  permutations  of  E.  The  following  two  theorems 
show  that  the  notions  of  a  substring  cover  and  a  difference  cover  are  equivalent. 

Theorem  10  Let  II  be  a  permutation  set  on  n  elements,  and  let  E  be  a  k-element  substring 
cover  for  II.  Then  II  has  a  difference  cover  $  with  at  most  k  elements. 

Proof.  Given  a  k-element  substring  cover  E  =  (cr0,  0i,  ■ .  .  ,  <7*_i)  for  II,  a  difference  cover  $ 
with  at  most  k  elements  can  be  constructed.  For  each  0  <  i  <  k  —  1  we  define  <p «  =  ooox  ■  •  ox. 
If  a  permutation  it  can  be  represented  as  it  =  <Ti<7,+i  •••<?],  then  it  =  4>jl\4>y  By  construction, 
the  difference  cover  $  has  at  most  k  elements.  B 

Theorem  11  Let  II  6e  a  permutation  set  on  n  elements,  and  let  $  be  a  k-element  difference 
cover  for  II.  Then  II  has  a  substring  cover  E  with  k  elements. 

Proof.  Given  a  k-element  difference  cover  $  =  {<£o,d>i, . .  .  for  II,  we  build  a  substring 

cover  E  for  II  by  defining  <7,  =  4>~}x4>i  for  all  0  <  i  <  k  -  1.  The  product  <70^i  •  •  -dk-i  yields 
the  identity  permutation.  For  each  it  €  II,  if  it  =  then  it  =  a ,+i01+2  •  •  -a j.  Therefore  E 

is  a  substring  cover  for  II  with  k  elements.  fl 

Referring  back  to  the  example  of  the  eight  compass  directions,  we  present  a  substring 
cover  for  the  permutation  set  II  =  {I,  N,  E,  S,  W,NE,SE,  NW,SW}.  The  substring  cover  E  = 
(S,E,  N,W)  is  constructed  from  the  difference  cover  $  =  {I, E, NE, N}  that  was  used  in  the 
architecture  of  Figure  2-2.  Each  of  the  eight  compass  directions  can  be  realized  as  a  substring 
of  the  list  E  =  (S,E,N,W). 

As  another  example,  consider  the  permutation  set  II  =  {I,N,E, S,W}  of  the  shifts  in  a 
2-dimensional  array  corresponding  to  the  four  compass  directions.  This  permutation  set  has 
a  difference  cover  $  =  {I,  SE,  S}  and  a  corresponding  substring  cover  E  =  (N,SE,W).  Con¬ 
sequently,  there  is  a  uniform  architecture  for  realizing  the  four  compass  directions  with  three 
pins  per  chip,  as  has  been  observed  by  Feynman  [36,  pp.  437-438].  Figure  2-3  presents  a 
uniform  architecture  based  on  the  difference  cover  $  =  {I,  SE,  S}  for  the  permutation  set 
n  =  {I,  N,E,S,W}. 
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Figure  2-3:  A  uniform  architecture  due  to  Feynman  [36]  baaed  on  the  difference  cover  {I,SE,  S}  for 
the  permutations  set  II  =  {I,N,E,S,  W}. 


2.4  Cyclic  shifters 


This  section  describes  uniform  architectures  for  realizing  cyclic  shifts  among  n  chips  in  one 
clock  tick.  We  first  present  a  difference  cover  of  size  0{y/n )  for  the  set  of  all  n  cyclic  shifts  on 
n  elements,  and  we  give  an  area-efficient  layout  for  the  corresponding  permutation  architecture 
suitable  for  implementation  as  a  printed-circuit  bor  d.  When  n  can  be  expressed  as  n  = 
q2  +  q  +  1,  where  q  is  a  power  of  a  prime,  we  improve  the  bound  on  the  size  of  a  difference  cover 
for  all  cyclic  shifts  on  n  elements  to  the  optimal  value  of  (V”T  Finally)  we  prove  that  for  any 
cyclic  shifter  that  operates  in  one  clock  tick  (even  a  nonuniform  one),  the  average  number  of 
pins  per  chip  is  at  least 
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2.4.1  General  difference  covers  for  cyclic  shifts 

The  first  permutation  architecture  for  cyclic  shifters  that  we  present  is  based  on  the  construction 
in  the  following  theorem. 

Theorem  12  The  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  at  most 
2^1-1. 

Proof.  Since  the  set  of  n  cyclic  shifts  on  n  elements  forms  a  group,  and  since  this  group  is 
isomorphic  to  the  group  Z„,  we  shall  construct  a  difference  cover  D  for  Zn.  For  convenience, 
let  m  =  Define  two  sets  A  =  {0,  -  1}  and  B  =  {0,m,2m,. . . ,  (m  -  l)m},  and 

let  the  difference  cover  D  be  defined  by  D  =  A  U  B.  Each  element  s  €  Zn  can  be  realized  as 
s  =  b-  a  (mod  n),  where  a  €  A  and  b  €  B  by  taking  a  =  m  -  (s  mod  m)  and  b  =  [s/m]  •  m,  as 
can  be  verified.  The  size  of  the  difference  cover  D  is  2m  -1  =  2  f^/nl  -  1,  since  the  element  0 
occurs  in  both  A  and  B.  H 

The  difference  cover  constructed  in  the  proof  of  Theorem  12  corresponds  to  an  architecture 
with  a  regular,  area-efficient  layout,  as  shown  in  Figure  2-4.  The  n  chips  of  the  architecture 
are  laid  out  in  an  array  consisting  of  m  =  y/n  rows,  each  containing  y/n  chips.  (For  simplicity, 
we  assume  that  n  is  a  square.)  Each  chip  has  pins  0,1,..., m  -  1  on  the  top  side,  and  pins 
m,  m  +  1, . . . ,  2m  -  1  on  the  left  side.  Each  bus  consists  of  one  vertical  segment  and  one  or  two 
horizontal  segments.  Each  wiring  channel  consists  of  m  =  y/n  tracks,  where  each  track  is  used 
to  lay  out  segments  of  busses.  When  n  is  not  a  square,  a  cyclic  shifter  on  n  chips  can  be  laid 
out  in  a  similar  fashion,  with  each  wiring  channel  having  at  most  2  \y/n"\  tracks.  The  side  of 
the  layout  is  therefore  0(n),  since  there  are  [\/»»l  chips  and  \y/n]  wiring  channels  along  the 
side.  The* area  of  the  layout  is  0(n 2),  which  is  asymptotically  optimal  since  any  architecture 
that  can  realize  any  of  the  cyclic-shift  permutations  in  one  clock  tick  requires  area  fi(n2)  [83, 
p.  56]. 

Remark.  The  bound  of  2  \y/n~\  -  1  pins  per  chip  can  be  improved  to  (y/2  +  o(l))>/n,  as 
was  observed  by  Mills  and  Wiedemann  [68].  See  Section  2.7.4. 

Occasionally,  it  is  desirable  to  implement  a  subset  of  the  cyclic  shifts  on  n  elements.  The  fol¬ 
lowing  corollary  to  Theorem  12  shows  that  when  the  shift  amounts  form  an  arithmetic  sequence, 
a  small  difference  cover  exists. 
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Figure  2-4:  A  layout  for  a  cyclic  shifter  with  n  =  16  chips.  Each  chip  and  each  bus  has  7  pins.  Each 
bus  is  constructed  of  one  vertical  segment  and  either  one  or  two  horizontal  segments. 

Corollary  13  Let  a,  6,  and  p  be  integers  modulo  n.  For  each  r  6  [p],  define  zr  to  be  the 
permutation  on  [n]  that  maps  teach  c  €  [n]  to  c  +  a  +  rb  (modn).  Then  the  permutation  set 
{*>  :  r  €  [p]}  has  a  difference  caver  of  size  2  f y/p}- 

Proof.  As  in  the  proof  of  Theorem  12,  we  construct  two  sets  A  and  B  whose  union  is  the 
desired  difference  cover.  The  sets  are  A  =  {0,6,26, ..  .,(m  -  1)6}  and  B  =  {a,  a  +  mb. 
a  +  2mb, .  ..,a  +  (m  -  l)m6},  where  m  =  \y/p].  B 
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2.4.2  Optimal  difference  covers  for  cyclic  shifts 

Returning  to  the  problem  of  implementing  all  n  cyclic  shifts  on  n  elements,  the  fallowing 
theorem  demonstrates  that  for  certain  values  of  n,  the  optimal  \y/n~\  bound  can  he  obtained.. 

Theorem  14  The  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size  [•^/rT]  if 
n  ss  q2  +  q  +  1,  where  q  is  a  power  of  a  prime. 

Proof.  As  in  the  proof  of  Theorem  12,  the  problem  is  equivalent  to  that  of  constructing  a. 
difference  cover  D  for  Z„.  When  n  is  the  size  of  a  projective  plane  (n  =  q2  +  q  +  1,  where  q  is  a. 
power  of  a  prime),  this  problem  is  equivalent  to  the  problem  of  constructing  a  difference  set..  The 
difference  set  we  give  is  due  to  Singer;  a  proof  of  its  correctness  is  given  in  Hall  [43,  p.  129]..  Let  zr 
be  a  primitive  root  of  the  Galois  field  GF(q3),  and  let  F(y)  be  any  irreducible  cubic  polynomial 
over  the  Galois  field  GF(q).  We  construct  a  difference  cover  D  for  Zn  from  the  set  [nj  by  choosing 
those  i  €  [n]  such  that  the  power  x’  can  be  written  in  the  form  z*  =  ax  +  b  (modF(z))  far. 
some  a,  6  €  GF(q).  B 

The  construction  of  a  uniform  architecture  based  on  a  projective  plane  can  be  interpreted 
as  follows.  The  n  points  of  the  projective  plane  correspond  to  the  n  chips,  and  the  re  fines  of  the 
projective  plane  correspond  to  the  n  busses.  Each  line  contains  q  +  1  points,  which  means  that 
each  bus  is  connected  to  q  +  1  chips.  Each  point  is  incident  on  q  +  1  lines,  which  means  that 
each  chip  is  connected  to  q  +  1  different  busses  through  its  q  +  1  pins.  For  example,.  Figure  2-1 
demonstrates  a  uniform  architecture  based  on  the  projective  plane  of  size  13. 

Theorems  similar  to  Theorem  12  (but  without  application  to  architecture)  appear  in  the 
combinatorics  literature:  see,  for  example,  [56].  Bus  connection  networks  based  on  projective 
planes  have  also  been  studied  by  Bermond,  Bond,  and  Scale  [11]  and  by  Micktmas  [84f,.  who 
observed  that  projective  planes  can  be  used  to  construct  hypergraphs  of  diameter  cme. 

2.4.3  Lower  bound  for  cyclic  shifters 

Uniform  architectures  for  cyclic  shifters  based  on  projective  planes  achieve  the  minimal  number 
of  pins  per  chip  among  all  uniform  cyclic  shifters.  We  now  prove  a  lower  bound  of  |\/n"[  on 
average  number  of  pins  per  chip  for  any  permutation  architecture  that  realizes  all  the  cyclic 
shifts.  This  lower  bound  applies  to  all  permutation  architectures,  including  nonunifbrm  ones.. 


42 


CHAPTER  2.  BUSSED  PERMUTATION  ARCHITECTURES 


and  shows  that  uniform  cyclic  shifters  based  on  projective  planes  are  optimal  among  all  cyclic 
shifters  that  operate  in  a  single  clock  tick. 

Theorem  15  Let  A  =  (C,  B,  P,CHIP,  BUS,  LABEL)  be  a  permutation  architecture  for  the  n 
cyclic  shifts  on  n  chips.  Then  the  average  number  of  pins  per  chip  is  at  at  least  [%/”!• 

Proof.  The  average  number  of  pins  per  chip  is  |P|  /n.  We  shall  prove  that  |P|  >  n  \sfn  ]  which 
implies  the  theorem.  We  adopt  the  following  conventions  for  notational  convenience: 

1.  The  set  of  busses  is  B  =  {6o,6i,...,6m_j}.  We  denote  by  Jfc,  the  number  of  pins  connected 
to  bus  6,,  that  is,  k{  =  \{x  £  P  :  BUS(i)  =  6^} | . 

2.  The  busses  that  have  at  least  [>/”1  pins  each  are  indexed  first,  that  is,  if  there  are  r 
busses  with  at  least  \y/n ]  pins  each,  then  fc,  >  fy/n  1  for  t  =  0, ...» r  —  1  and  fc,  <  \y/n ] 
for  *  =  r, . .  .,m  -  1. 

The  thrust  of  the  proof  is  to  count  the  number  of  distinct  data  transfers  when  the  architec¬ 
ture  realizes  each  of  the  n  -  1  nontrivial  shifts  in  turn.  (The  identity  permutation  is  a  trivial 
shift.)  Each  chip  can  be  mapped  to  each  other  chip  by  one  of  the  cyclic  shifts,  i.e.,  the  cyclic 
shifts  form  a  transitive  group  of  permutations.  Considering  only  the  n  - 1  nontrivial  shifts,  there 
are  exactly  n(n-  1)  distinct  data  transfers  that  must  be  implemented  through  interconnections 
in  the  architecture. 

We  compute  an  upper  bound  on  the  number  of  distinct  data  transfers  that  the  busses  can 
implement.  Each  of  the  first  r  busses  bo,...,br-i  can  be  employed  to  realize  at  most  one 
distinct  data  transfer  in  each  of  the  n  -  1  nontrivial  shifts.  Thus,  at  most  r(n  -  1)  distinct  data 
transfers  can  be  carried  out  by  the  first  r  busses.  Any  other  bus  6j,  where  r  <  i  <  m  -  1,  can 
realize  at  mo6t  ki(kx  -  1)  distinct  nontrivial  data  transfers,  since  it  has  only  Jfc,  pins  connected 
to  it.  Thus,  the  total  number  of  distinct  data  transfers  that  the  busses  can  realize  is 

m-l 

r(n  -  1)  +  £  ki(ki  -  1)  , 
i=r 

which  must  be  larger  than  n(n  -  1)  if  all  nontrivial  shifts  are  to  be  realized.  Hence,  we  have 

£  ^  (n~  r)(n  ~  !)  • 

»=r 
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We  can  use  this  inequality  to  bound  the  number  of  pins  on  all  busses  with  fewer  than  |V”1 
pins.  We  have  k>  -  1  <  [\/n ]  -  2  for  t  =  r, . . . ,  m  -  1,  and  thu6 

m  — 1  .  m  — 1 

£  k'  ~  \y/n]-2  £  "  l) 

»=r  1  v  1  i=r 

(n-  r)(n-  1) 

’  fv^l  -  2 

>  (n-r)fv/n]  - 

We  now  bound  the  total  number  of  pins  in  the  architecture  from  below.  We  have 

m  — 1 

\r\  = 

i=0 

t— 1  m— 1 

= 

i=0  t=r 

>  r  fVn]  +  (n  -  r)  fv/n] 

=  nfv/n], 

which  proves  the  theorem.  I 


2.5  Difference  covers  for  groups 

In  this  section  we  show  that  small  difference  covers  for  abelian  and  nonabelian  permutation 
groups  exist.  Specifically,  for  any  abelian  permutatior  group  II  with  p  elements,  we  apply  the 
decomposition  theorem  for  finite  abelian  groups  and  the  results  for  cyclic  shifters  in  Section  2.4, 
and  we  show  the  existence  of  a  difference  cover  of  size  0(y/p ),  which  is  optimal  to  within  a  con¬ 
stant  factor.  For  a  general  permutation  group  II  with  p  elements,  we  give  a  greedy  construction 
of  a  difference  cover  with  0(>/p]gp )  elements.  Finkelstein,  Kleitman,  and  Leighton  [31]  have 
recently  improved  our  result  for  general  groups  to  0(y/p). 

2.5.1  Abelian  groups 

We  first  show  that  if  a  permutation  set  forms  an  abelian  group  with  p  permutations,  then  a 
difference  cover  of  size  0{y/p)  can  be  constructed. 

Theorem  10  For  any  abelian  group  II  with  p  elements,  there  exists  a  difference  cover  ♦  of 
size  at  most  3y fp . 


44 


CHAPTER  2.  BUSSED  PERMUTATION  ARCHITECTURES 


Proof.  Assume  without  loss  of  generality  that  p  >  1.  By  the  decomposition  theorem  for  finite 
abelian  groups  [58,  p.  133],  any  abelian  group  II  is  isomorphic  to  a  cross  product  of  cyclic  groups 

II  *  ZPl  x  Zp,  x  •  •  •  x  ZPk , 

where  p\p2  •  ■  -  Pk  —  P,  and  each  p}  >  2.  Let  »  be  the  unique  index  such  that  p\p?  ■  •  -p,_i  <  y/p 
and  p,+ip,+2  •  •  -pk  <  >/p ,  and  let  m  =  \y/p/piPi  •  •  -p,_i].  Using  the  argument  of  Theorem  12, 
we  first  construct  a  difference  cover  for  ZPl  from  the  union  of  two  sets  A,  and  Bt,  where  |A,|  <  m 
and  |j?,|  <  [p,/mj,  such  that  each  element  of  ZPt  can  be  expressed  in  the  form  b  -  a  (modp.) 
or  a  -  b  (modp,),  where  a  6  A,  and  b  €  B,. 

We  now  construct  a  difference  cover  for  II  %  ZP1  x  Zp,  x  •  ■  •  x  Zp*  from  the  union  of  two 
sets  A  and  B ,  where 

A  %  ZPl  x  Zp,  x  x  Zp,_j  x  At, 

and 

B  ss  B,  x  ZPl+,  x  ZPl+J  x  *  •  ’  x  ZP4. 

That  A  U  B  is  a  difference  cover  for  II  follows  from  essentially  the  same  argument  as  is  used  in 
Lemma  9.  The  size  of  the  difference  cover  A  U  B  is  |A|  +  |f?|.  The  size  of  A  is 

|A|  =  P1P2  ■ '  'Pi— 1  |A,| 

<  pipi---p,_im 

<  PlPl  '  '  '  Pi-1  f\/P  / PlP2  ‘  '  "Pi— ll 

<  VP  +P1P2  -Pi-\ 

<  2  y/p. 

Similarly,  the  size  of  B  is 

|fl|  =  |flt|pl+ip,+2---p* 

<  [p./mJpi+iP,  +  2"-P* 

<  (pi/  Wp/p\p 2  •  •  -P.-1DP.+1P.+2  •  •  Pk 

<  (PlP2"-p.7v/P)P«  +  lP«+2-"Pk 


=  y/P- 

Consequently,  the  size  of  the  difference  cover  for  II  i6  at  most  3 y/p . 
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2.5.2  General  grour 

The  next  theorem  gives  a  method  for  constructing  small  difference  covers  for  general  groups  of 
permutations. 

Theorem  17  Let  U  be  an  arbitrary  group  with  p  elements.  Then  II  has  a  difference  cover  $ 
of  size  at  most  y/2p\np  +  1. 

Proof.  We  construct  a  difference  cover  incrementally  starting  with  a  partial  difference  cover 
=  {/}.  At  each  step  of  the  construction,  we  select  an  element  &+1  £  II  such  that 
U  {<?i>,+i})|  maximizes  U  {ir})|  over  all  ir  £  II.  We  then  define  the  new  partial 

difference  cover  as  $,+i  =  U{*+i}. 

The  analysis  of  this  construction  is  in  three  parts.  We  first  determine  a  lower  bound  on  the 
number  of  elements  of  II  that  are  not  covered  by  the  partial  difference  cover  but  are  covered 
by  ♦,-+!.  We  then  develop  a  recurrence  to  upper  bound  the  number  of  elements  of  the  group 
II  that  are  not  covered  at  the  »th  step.  Finally,  we  solve  the  recurrence  to  determine  that  the 
number  k  of  iterations  needed  to  cover  all  elements  in  II  is  at  most  ^/SpTnp  +  1. 

We  first  determine  how  many  new  elements  of  II  are  covered  when  is  augmented  with 
<2>,+i  to  produce  $,+i,  for  «'  >  1.  Let  the  set  A,  be  the  set  of  elements  that  are  not  covered  by 
the  partial  difference  cover  which  can  be  defined  as  A*  =  II  -  Consider  triples  of 

the  form  (<t>,S,Tr)  such  that  <t>  £  ♦j,  6  £  A,,  r  £  II,  and  <t>6  =  x.  Observe  that  for  any  fixed 
t  €  II  and  6  £  A,,  there  is  at  most  one  triple  of  the  form  (<f>,6,*)  in  the  set  of  triples,  namely 
when  x£-1  £  For  a  fixed  jt,  the  number  of  triples  (<p,6,r)  in  the  set  of  triples 
is  a  lower  bound  on  the  number  of  elements  covered  by  U  {*■}  but  not  by  since  we  have 
6  =  <p~xx*and  6  €  A,  =  II  —  For  each  0  €  and  6  £  A;,  there  is  exactly  one  triple 

in  the  set  of  triples,  and  thus  there  are  exactly  |$,|  •  |A,|  triples.  Since  there  are  at  most  )II| 
distinct  permutations  appearing  as  the  third  coordinate  of  a  triple,  the  permutation  4>,+\  that 
appears  most  often  must  appear  at  least  |$,j  •  |A,|  /  |II|  times,  and  hence  at  least  this  many 
elements  are  covered  by  $l+i  that  are  not  covered  by 

We  can  now  bound  the  number  of  elements  not  covered  by  in  terms  of  the  number  of 
elements  not  covered  by  ♦,  by 


!A,+i|  <  |A, 


mi 
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=  ,A'i  (‘  - ;) 

-  '*#-;) 

<  ,n(.-i). 

When  we  obtain  |A^|  <  1  for  some  k,  the  partial  difference  cover  ik  is  a  difference  cover  for  II 
because  A  *  is  empty.  Thus,  is  a  difference  cover  when 

*~1  ,  .V 

or  equivalently,  when 

lnP+EIn(1_  i)  <0. 

Using  the  inequality  ln(l  +  i)  <  x,  we  have 

kP+g..(i-i) 


Thus,  is  a  difference  cover  when  k  >  y/2p\np  +  1.  I 

This  proof ‘of  Theorem  17  provides  a  construction  which  can  be  implemented  as  an  deter¬ 
ministic,  polynomial-time  algorithm  with  0(p7  lgp)  algebraic  steps.  We  could  also  have  proved 
the  theorem  by  relying  on  the  result  of  Babai  and  Erdos  [4]  that  any  group  has  a  small  set  of 
generators,  but  this  method  would  have  produced  only  an  existential  (nonconstructive)  result. 

Finkelstein,  Kleitman,  and  Leighton  [31]  have  recently  improved  our  result  for  general  groups 
to  0(y/p).  Their  proof  uses  a  folk  theorem  [25]  that  every  simple  group  of  nonprime  order  p 
has  a  subgroup  of  size  at  least  y/p .  The  folk  theorem  is  proved  by  checking  each  type  of  group 
in  the  classification  theorem  [39,  pp.  135-136]. 


k- 1 


< 

>=i  v 
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<  lap  - 
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2.6  Multiple  clock  ticks 

In  this  section,  we  discuss  uniform  permutation  architectures  that  realize  permutations  in  sev¬ 
eral  clock  ticks.  By  using  more  than  one  clock  tick,  further  savings  in  the  number  of  pins  per 
chip  can  be  obtained.  We  first  generalize  the  notion  of  a  difference  cover  to  handle  multiple 
clock  ticks.  We  then  describe  a  cyclic  shifter  on  n  chips  with  only  0(n1/,2()  pins  per  chip  that 
operates  in  t  ticks. 

2.6.1  The  notion  of  a  t-difference  cover 

We  first  generalize  the  notion  of  a  difference  cover  to  handle  realization  of  permutations  in  t  >  1 
clock  ticks. 

Definition  8  A  t-difference  cover  for  a  permutation  set  II  is  a  set  ♦  of  permutations  such  that 

($-*$)<  d  n. 

Using  a  t-difference  cover  $  for  the  permutation  set  II,  any  permutation  ir  €  II  can  be 
expressed  as  the  composition  of  t  differences  of  permutations  from  The  next  lemma  relates 
f-difference  covers  to  permutation  architectures  that  uniformly  realize  permutations  in  t  clock 
ticks,  for  general  values  of  f. 

Lemma  18  Let  $  be  a  t-difference  cover  with  k  elements  for  a  permutation  set  II.  Then  there 
is  a  permutation  architecture  with  k  pins  per  chip  that  uniformly  realises  II  in  t  clock  ticks. 

Proof.  We  define  the  permutation  set  E  =  Let  A  =  (C,  B,  P,  chip,  bus,  label)  be  the 

permutation  architecture,  based  on  the  difference  cover  $,  that  uniformly  realizes  E.  Hence,  the 
permutation  architecture  A  can  uniformly  realize  any  o  €  E  in  one  clock  tick.  Each  permutation 
ir  €  II  can  be  expressed  as  x  =  Ot-i^t-2  *  •  -<7o,  where  o,-  €  E  for  0  <  *  <  t  —  1,  since  we  have 
E*  =  D  II.  In  order  to  realize  jt  in  t  clock  ticks,  the  permutation  architecture  A 

uniformly  realizes  o,  in  clock  tick  »  for  0  <  *  <  f  -  1.  I 

2.6.2  Constructing  t-difference  covers  for  cyclic  shifters 

Lemma  18  claims  that  the  problem  of  uniformly  realizing  a  permutation  set  II  in  t  clock  ticks 
can  be  reduced  to  finding  a  permutation  set  E  such  that  Ef  3  II,  and  then  finding  a  difference 
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cover  for  E.  The  great  advantage  of  using  more  than  one  dock  tick  is  in  the  further  savings 
in  the  number  of  pins  per  chip.  The  following  theorem,  for  example,  describes  a  construction 
of  a  t-difference  cover  of  size  0(n^2t)  for  the  set  of  cyclic  shifts  on  n  objects.  This  result  can 
be  used  to  build  a  uniform  architecture  on  n  chips  with  only  (^(n1/2*)  pins  per  chip  that  can 
realize  any  cyclic  shift  on  the  n  chips  in  t  clock  ticks. 

Theorem  19  For  any  n  >  1  and  t  >  1,  the  permutation  set  of  all  the  n  cyclic  shifts  on  n 
objects  has  a  t-difference  cover  of  size  0(nl^2t). 

Proof.  For  the  purpose  of  the  proof,  we  denote  the  permutation  set  of  all  the  n  cyclic  shifts 
on  n  objects  by  II„.  (We  remind  that  IIn  a  Zn.)  We  first  treat  the  case  for  those  n  such  that 
there  exists  an  integer  m  satisfying  n1^  <  m  <  4n1/‘  and  gcd(m,n)  =  1.  We  then  use  this  case 
to  extend  the  proof  to  all  values  of  n. 

Since  gcd(m,  n)  =  1,  there  exists  an  m_1  6  Z„  such  that  m  •  m-1  =  1  (modn).  For  each 
r  €  [m],  define  the  permutation  or  :  [n]  -»  [n]  as  or(c)  =  m_1(c  +  r)  (modn),  and  define  the 
permutation  o'T  :  [n]  — *  [n]  as  o'(c)  =  m‘_1(c  +  r)  (modn).  Next  define  the  permutation  set 
E  =  {or}  U  {o'}.  The  set  {or}  is  an  arithmetic  sequence  of  cyclic  shifts  on  n  elements  (as  in 
Corollary  13)  followed  by  the  fixed  permutation  corresponding  to  multiplication  by  m"1,  and 
thus  {crT}  has  a  difference  cover  of  size  0(y/m).  Similarly,  the  set  {o'}  has  a  difference  cover  of 
size  0(y/m).  Combining  the  two  difference  covers  for  {or}  and  {o'},  we  get  a  difference  cover 
$  of  size  0(y/m  )  =  0(n1/,2f)  for  E. 

We  now  show  the  inclusion  E*  D  IIn.  Let  x  €  IIn  be  a  permutation  of  a  cyclic  shift  by 

s.  We  express  the  shift  amount  a  €  [n]  as  s  =  so  +  sim  + - h  s<_  jm*-1,  where  a,  €  [m]  for 

0  <  i  <  t  -  1.  The  permutation  x  can  be  described  as 

t(c)  =  c  +  s(modn) 

=  c  +  so  +  +  •  •  •  +  (mod  n) 

=  m‘-1  (st-\  +  m"1  (st_2  + - 1-  m-1  («o  +  c)))  (mod  n) 

which  proves  that  r£  E(.  Hence,  we  get  the  inclusion  E‘  D  IIn,  which  together  with  the  fact 
that  there  is  a  difference  cover  $  of  size  0(n1/,J‘)  for  E,  proves  the  theorem  for  the  case  when 
there  exists  an  integer  m  satisfying  n <  m  <  4H1/*  and  gcd(m,n)  =  1. 


2.7.  APPLICATIONS  AND  EXTENSIONS 


49 


Such  an  m  need  not  exist  for  every  n  and  every  t,  however.  We  can  overcome  this  difficulty 
by  factoring  n  =  njti2  such  that  ni  consists  of  no  even-indexed  primes  (3,  7,  13,  . . .)  and  ri2 
consists  of  no  odd-indexed  primes  (2,  5,  11, . . .).  Since  we  have  gcd(ni,n2)  =  1,  we  can  use  the 
Chinese  remainders  theorem  to  express  Zn  as  a  Cartesian  product  Zn  as  Zni  x  Zni.  We  let  m* 
be  the  first  even-indexed  prime  at  least  as  large  as  n}^‘,  and  let  m2  be  the  first  odd-indexed 
prime  at  least  as  large  as  Bertrand’s  postulate  [44,  p.  343]  guarantees  that  for  every  x, 
there  is  a  prime  between  x  and  2x,  which  means  mj  €  [»j^,4nj^]  for  j  =  1,2.  (Tighter  bounds 
are  possible.) 

We  can  now  use  the  previous  construction  to  construct  a  t-difference  cover  of  size  0(  n\^2t) 
for  Zn, ,  which  is  isomorphic  to  !!„, ,  and  a  t-difference  cover  $2  of  size  0(n\^2t)  for  Znj,  which 
is  isomorphic  to  nn2.  Using  the  same  technique  as  in  the  proof  of  Lemma  9,  we  can  construct 
a  t-difference  cover  of  size  0(n]/2‘)  •  0(n^2*)  =  (^(n1/3*)  for  Z„,  x  Znj  «  Z„  w  IIn.  H 

One  can  rather  straightforwardly  use  Corollary  13  to  obtain  a  t-difference  cover  of  size 

0{tnl/2t).  Based  on  the  representation  of  the  shift  amount  3  =  s0  +  $im  -| - 1-  s^im1-1,  one 

can  come  with  t  separate  difference  covers,  each  of  size  Otn1^21),  for  the  t  separate  sequences 
of  arithmetic  shifts  by  (jm1  :  a  €  [m]}  for  0  <  »  <  t  -  1.  Theorem  19  avoids  the  extra  factor 
of  t  by  constructing  only  one  such  difference  cover  and  using  its  elements  for  each  one  of  the  t 
differences. 


2.7  Applications  and  extensions 

This  sectiob  contains  some  additional  results  on  permutation  architectures  and  difference  cov¬ 
ers.  We  describe  efficient  uniform  architectures  that  can  realize  the  permutations  implemented 
by  various  popular  interconnection  networks,  including  multidimensional  meshes,  hypercubes, 
and  shuffle-exchange  networks.  We  extend  the  lower  bound  technique  of  Section  2.4.3  to  general 
permutation  sets.  We  examine  nonuniform  permutation  architectures,  and  adapt  some  com¬ 
binatorial  results  in  the  literature  to  apply  to  permutation  architectures.  Finally,  we  describe 
directions  for  further  research  and  some  related  work  brought  on  by  an  earlier  version  [48]  of 
this  research. 
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2.7.1  More  networks 

By  using  busses,  many  popular  interconnection  networks  can  be  realized  with  fewer  pins  than 
conventionally  proposed.  Here,  we  mention  a  few. 

The  permutation  architectures  for  realizing  compass  shifts  on  two-dimensional  arrays  can 
be  extended  in  a  natural  fashion  to  d-dimensional  arrays.  For  the  d-dimensional  analogue 
of  the  shifts  {I,  N,E,  S,W},  there  is  a  uniform  architecture  that  uses  only  d  +  1  pins  per 
chip  to  implement  the  2d  +  1  permutations.  For  the  d-dimensional  analogue  of  the  shift; 
{I,  N,  E,  S,  W,  NE,  SE,  NW,  SW},  there  is  a  uniform  architecture  with  only  2d  pins  per  chip  that 
implements  all  3d  shifts.  (These  results  were  independently  obtained  by  Fiduccia  [29,  30].) 

A  Boolean  hypercube  of  dimension  d  is  a  degenerate  case  of  a  d-dimensional  array.  Only 
d  + 1  pins  per  chip  are  required  by  a  permutation  architecture  that  uses  busses,  whereas  2d  pins 
per  chip  are  needed  if  point-to-point  wires  are  used.  (To  realize  a  swap  of  information  across 
a  dimension  in  one  clock  tick,  each  chip  requires  two  pins  for  that  dimension:  one  to  read  and 
one  to  write.)  It  is  interesting  to  mention  that  in  the  case  of  the  d-dimensional  hyperc  jbe,  the 
permutation  set  consists  of  d  permutations  of  swapping  data  across  each  of  the  d  dimensions. 
For  this  case,  Fiduccia  [30]  shows  that  d  +  1  pins  per  chip  is  the  least  possible. 

A  permutation  architecture  that  implements  the  permutations  Shuffle,  Inverse  Shuffle,  and 
Exchange  can  be  constructed  with  three  pins  per  chip  instead  of  the  usual  four.  This  can  be 
done  by  taking  the  set  of  three  permutation:  Identity,  Shuffle,  and  Exchange,  which  forms 
a  difference  cover  for  the  desired  permutation  set.  Furthermore,  we  can  also  implement  the 
Shuffle-Exchange  and  Inverse  Shuffle- Exchange  permutations  in  one  clock  tick  as  well. 

2.7.2  Average  number  of  pins  per  chip 

Theorem  15  presents  a  lower  bound  on  the  average  number  of  pins  per  chip  in  any  cyclic  shifter 
that  operates  in  one  clock  tick.  The  following  theorem  is  a  natural  extension  of  Theorem  15  for 
a  general  set  of  permutations. 

Theorem  20  Let  II  be  o  permutation  set  on  n  objects  with  p  permutations  and  with  total 
of  T  nontrivial  data  transfers,  and  let  A  —  (C,B,P,  CHIP,  BUS,  LABEL)  be  any  permutation 
architecture  for  realizing  II.  Then  the  average  number  of  pins  per  chip  is  at  least  T/n^/p . 
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Proof.  As  in  the  proof  of  Theorem  15,  we  prove  that  |P|  >  T/y/p  which  implies  the  theorem. 
We  make  similar  notational  conventions: 


1.  The  set  of  busses  is  B  =  {6o>  &i,  •  •  • ,  We  denote  by  ki  the  number  of  pins  connected 

to  bus  b{. 

2.  The  r  busses  that  have  at  least  y/p  pins  each  are  indexed  first,  that  is  ki  >  y/p  for 
i  =  0, . . . ,  r  —  1  and  ik,  <  y/p  for  *  =  r, . . .,  m  —  1. 

We  count  the  number  of  distinct  data  transfers  tha‘  can  be  accomplished  by  each  bus.  Each 
of  the  first  r  busses  can  be  employed  to  realize  at  most  p  out  of  the  T  nontrivial  data  transfers, 
since  it  can  be  used  at  most  once  for  each  of  the  p  permutation.  Any  other  bus  bi,  where 
r  <  t  <  m  -  1,  can  realize  at  most  fc;(Jk,  -  1)  out  of  the  T  nontrivial  data  transfers,  since  it  has 
only  ki  pins  connected  to  it.  We  need  to  have  *i(*i  ~  1 )  >  T  —  rp,  which  implies 


m-1 


L*.  > 


T-rp 

V~P 


Vp 


-Ty/p. 


The  number  of  pins  in  the  architecture  can  now  be  bounded  as  follows: 

m  — 1 

|f|  =  L*. 

i=0 

r— 1  m— 1 

=  2 


i=0 


>  ry/p  + 

— 

VP  ■ 


Theorem  20  demonstrates  that  uniform  architectures  can  achieve  the  optimal  number  (to 
within  a  constant  factor)  of  pins  per  chip  for  certain  classes  of  permutation  sets.  When  there 
are  relatively  few  permutations  that  are  responsible  for  many  nontrivial  data  transfers,  the 
average  number  of  pins  per  chip  is  high.  The  set  of  cyclic  shifts  is  an  example  of  this  kind  of 
permutation  set. 
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2.7.3  Nonuniform  architectures 

When  the  uniformity  condition  on  permutation  architectures  is  dropped,  one  can  do  much  better 
in  terms  of  the  number  of  pins  per  chip.  The  complexity  of  control  may  increase  substantially, 
however,  due  to  the  irregular  communication  patterns  and  the  number  of  possible  permutations 
realizable  for  some  of  the  architectures.  Nevertheless,  from  a  mathematical  point  of  view, 
nonuniform  architectures  are  quite  interesting. 

In  fact,  nonuniform  architectures  have  been  studied  quite  extensively  in  the  mathematics 
literature  in  the  guise  of  partitioning  problems.  For  the  problem  of  realizing  all  n!  permu¬ 
tations  on  n  chips,  a  result  due  to  de  Bruijn,  Erdos,  and  Spencer  [84,  pp.  106-108]  implies 
that  0{y/n\gn)  pins  per  chip  suffice.  The  nonuniform  architecture  that  achieves  this  bound  is 
constructed  probabilistically,  however.  It  is  an  open  problem  to  obtain  this  bound  deterministi¬ 
cally.  The  best  deterministic  construction  to  date  is  due  to  Feldman,  Friedman,  and  Pippenger 
[26]  and  uses  0(n2/3)  pins  per  chip. 

2.7.4  Further  research 

We  list  a  few  of  the  problems  that  have  been  left  open  by  our  research.  We  also  describe  briefly 
some  further  work  brought  on  by  an  earlier  version  [48]  of  this  research. 

In  Section  2.4  we  described  a  difference  cover  of  size  2  -  1  for  the  cyclic  group  Zn, 

and  proved  that  when  n  is  the  order  of  a  projective  plane,  there  is  a  difference  cover  of  size 
[v/n'|.  It  seems  reasonable  that  any  cyclic  group  Zn  might  actually  have  a  difference  cover  of 
size  y/n  +  o(y/n),  but  we  have  been  unable  to  prove  or  disprove  this  conjecture.  Mills  and 
Wiedemann  [67]  have  computed  a  table  of  minimal  difference  covers  for  all  the  cyclic  gToups  of 
cardinality  up4o  110.  For  any  value  of  n  up  to  110,  the  difference  cover  they  find  has  at  most 
[\/n]  +  2  elements.  In  [68],  they  provide  a  “folk  theorem”  that  establishes  a  stronger  upper 
bound  for  the  general  case  than  2  f>/nl  -  1. 

Theorem  21  The  set  of  n  cyclic  shifts  on  n  elements  has  a  difference  cover  of  size 
(v/2  +  o(l))v/n. 

Sketch  of  proof.  [68]  Let  q  be  the  smallest  prime  such  that  l  =  q7  +  q  +  1  >  n/2.  We  have 
q  ss  (1  +  o(l))v/n/2,  since  for  large  z,  there  exists  a  prime  between  x  and  x  +  o(x).  Let 
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{do,di, .  be  a  difference  cover  for  Z j  chosen  as  in  Theorem  14.  It  can  be  verified  that  the 
set  {do,  di, . . . ,  d,}  U  {do  +  l,  d\  +  /,...,  dq  +  /}  forms  a  difference  cover  for  Z„.  H 

Another  interesting  problem  related  to  cyclic  shifters  involves  finding  an  area-efficient  VLSI 
layout  of  the  cyclic  shifter  based  on  projective  planes.  In  section  2.4  we  presented  an  area- 
efficient  layout  using  a  difference  cover  whose  size  is  twice  the  optimal  size.  Is  there  a  good 
layout  for  the  pin-optimal  design? 

To  implement  cyclic  shifters  that  operate  in  t  clock  ticks,  we  showed  how  to  construct  a 
t-difference  cover  for  Z„  of  size  0(nl^2t).  A  simpler  construction  achieves  the  bound  0(tnx^2t). 
Theorem  15  gives  a  lower  bound  of  on  the  average  number  of  pins  per  chip  for  a  cyclic 
shifter  that  operates  in  one  clock  tick.  It  may  be  possible  to  prove  a  lower  bound  of  ^(n1/2')  on 
the  average  number  of  pins  per  chip  when  an  architecture  operates  in  t  clock  ticks,  but  we  were 
unable  to  extend  the  argument.  We  were  also  unable  to  extend  either  of  these  constructions 
to  give  good  ^-difference  covers  for  groups,  either  general  or  abelian.  It  would  be  interesting 
to  know  whether  a  general  (or  an  abelian)  group  of  permutations  with  p  permutations  has  a 
t-difference  cover  of  size  0(tpxl2t),  for  any  t  >  1. 

We  have  concentrated  primarily  on  permutation  sets  that  have  good  structure,  specifically 
group  properties.  In  general,  when  the  permutation  set  has  no  known  structure,  the  best  possi¬ 
ble  upper  bound  is  given  by  Theorem  7  of  Section  2.3.2.  It  would  be  interesting  to  identify  other 
structural  properties  of  permutation  sets  besides  group  properties  that  allow  small  difference 
covers  to  exist. 
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Chapter  3 


Priority  Arbitration  with  Busses 


This  chapter  explores  how  busses  can  be  used  to  efficiently  implement  arbitration  mechanisms. 
We  investigate  priority  arbitration  schemes  that  use  busses  to  arbitrate  among  n  modules  in 
a  digital  system.  We  focus  on  distributed  mechanisms  that  employ  m  arbitration  busses,  for 
lg  n  <  m  <n,  and  use  asynchronous  combinational  arbitration  logic.  A  widely  used  distributed 
asynchronous  mechanism  is  the  binary  arbitration  scheme,  which  with  m  =  lg  n  busses  arbitrates 
in  t  =  lg  n  units  of  time.  We  present  a  new  asynchronous  scheme  —  binomial  arbitration  — 
that  by  using  m  =  lg  n  +  1  busses  reduces  the  arbitration  time  to  t  =  ^  lg  n.  Extending  this 
result,  we  present  the  generalized  binomial  arbitration  scheme  that  achieves  a  bus-time  tradeoff 
of  the  form  m  =  6(tn1^t)  between  the  number  of  arbitration  busses  m,  and  the  arbitration  time 
t  (in  units  of  bus-settling  delay),  for  values  of  lg  n  <  m  <  n  and  1  <  t  <  lg  n.  Our  schemes 
are  based  on  a  novel  analysis  of  data-dependent  delays  and  generalize  the  two  known  schemes: 
linear  arbitration,  which  with  m  =  n  busses  achieves  t  =  1  time,  and  binary  arbitration,  which 
with  m  =  lgn  busses  achieves  t  =  lgn  time.  Most  importantly,  our  schemes  can  be  adopted 
with  no  changes  to  existing  hardware  and  protocols;  they  merely  involve  selecting  a  good  set 
of  priority  arbitration  codewords.  We  also  investigate  the  capabilities  of  general  asynchronous 
priority  arbitration  schemes  that  employ  busses  and  present  some  lower  bound  arguments  that 
demonstrate  the  efficiency  of  our  schemes. 


This  chapter  describes  research  that  appeared  partially  in  [50]  and  [51], 
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CHAPTER  3.  PRIORITY  ARBITRATION  WITH  BUSSES 


3.1  Introduction 

In  many  electronic  systems  there  are  situations  where  several  modules  wish  to  use  a  common 
resource  simultaneously.  Examples  include  microprocessor  systems  where  a  decision  is  required 
concerning  which  of  several  interrupts  to  service  first,  multiprocessor  environments  where  several 
processors  wish  to  use  some  device  concurrently,  and  data  communication  networks  with  shared 
media.  To  resolve  conflicts,  an  arbitration  mechanism  is  required  that  grants  the  resource  to 
one  module  at  a  time. 

Numerous  arbitration  mechanisms  have  been  developed,  including  daisy  chains,  priority 
circuits,  polling,  token  passing,  and  carrier  sense  protocols,  to  name  a  few  (see  [12,  16,  22, 
40,  57,  61,  78,  82]).  In  this  chapter  we  focus  on  distributed  priority  arbitration  mechanisms, 
where  contention  is  resolved  using  predetermined  module  priorities  and  arbitration  processes  are 
carried  out  in  a  distributed  manner  by  participating  system  modules.  In  many  modern  systems, 
and  especially  in  multiprocessor  environments  and  data  communication  networks,  distributed 
priority  arbitration  is  the  preferred  mechanism. 

Many  distributed  arbitration  mechanisms  employ  a  collection  of  arbitration  busses  to  im¬ 
plement  priority  arbitration.  To  this  end,  each  module  is  assigned  a  unique  arbitration  priority, 
which  is  an  encoding  of  its  name.  An  arbitration  protocol  determines  the  logic  values  that  a 
contending  module  applies  to  the  busses,  based  on  the  module’s  arbitration  priority  and  on  logic 
values  on  the  busses.  After  some  delay,  the  settled  logic  values  on  the  busses  uniquely  iden¬ 
tify  the  contending  module  with  the  highest  priority.  In  particular,  the  asynchronous  binary 
arbitration  scheme,  developed  by  Taub  [79],  gained  popularity  and  is  used  in  many  modern 
bus  systems,  such  as  Futurebus  [17,  81],  M3-bus  [21],  S-100  bus  [35,  80],  Multibus'll  [40], 
Fastbus  [41],  and  Nubus  [89],  Other  priority  arbitration  mechanisms  that  employ  busses  are 
described  in  [12,  16,  22,  24,  47,  57,  61,  78,  82). 

The  asynchronous  binary  arbitration  scheme  arbitrates  among  n  modules  in  t  =  lg  n  units 
of  time,  using  m  =  lg  n  wired-OR  (open-collector)  arbitration  busses.1  The  technology  of  open- 
collector  busses  is  such  that  the  default  logic  value  on  a  bus  is  0,  unless  at  least  one  module 
applies  a  1  to  it,  in  which  case  it  becomes  a  1.  Open-collector  busses,  thus,  OR  together  the  logic 

‘Throughout  this  chapter  we  count  only  arbitration  busses  that  are  used  for  encoding  the  priorities.  Several 
additional  control  basses  are  used  by  all  schemes  and  are  therefore  not  counted. 
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values  applied  to  them,  with  some  time  delay  called  bus-settling  delay.  In  asynchronous  binary 
arbitration,  each  module  is  assigned  a  unique  (lg  n)-bit  arbitration  priority.  When  arbitration 
begins,  competing  modules  apply  their  arbitration  priorities  to  the  m  =  lg  n  busses,  each  bit 
on  a  separate  bus;  the  result  being  the  bitwise  OR  of  their  arbitration  priorities.  As  arbitration 
progresses,  each  competing  module  monitors  the  busses  and  disables  its  drivers  according  to 
the  following  rule:  if  the  module  is  applying  a  0  (that  is,  not  applying  a  1)  to  a  particular  bus 
but  detects  that  the  bus  is  carrying  a  1  (applied  by  some  other  module),  it  ceases  to  apply  all 
its  bits  of  lower  significance.  Disabled  bits  are  re-enabled  should  the  condition  cease  to  hold. 
The  effect  of  this  rule  is  that  the  arbitration  proceeds  in  at  most  lgn  stages  from  the  most 
significant  bit  to  the  least  significant  bit.  Each  stage  consists  of  resolving  another  bit  of  the 
highest  competing  binary  priority,  which  leads  to  a  worst-case  arbitration  time  of  1  =  lgn  (in 
units  of  bus-settling  delay). 

For  example,  consider  a  system  of  n  =  16  modules  that  uses  m  =  lg  16  =  4  arbitration 
busses,  with  the  16  arbitration  priorities  consisting  of  all  the  4-bit  codewords  {0000,  0001. 
0010,  0011,  0100,  0101,  0110,  0111,  1000,  1001,  1010,  1011,  1100,  1101,  1110,  1111}.  Figure 
3-1  outlines  an  asynchronous  binary  arbitration  process  among  four  such  modules  C2,  C5,  c9, 
and  cio,  with  corresponding  arbitration  priorities  0010,  0101,  1001,  and  1010.  The  arbitration 
process  begins  by  the  competing  modules  applying  their  arbitration  priorities  to  the  busses. 
The  open-collector  busses,  therefore,  compute  a  bitwise-OR  of  the  four  arbitration  priorities. 
After  one  unit  of  bus-settling  delay  (stage  1),  bus  63  settles  to  the  logic  value  1,  where  it  will 
remain  for  the  duration  of  the  arbitration.  By  the  above  rule,  each  of  modules  cj  and  C5  disables 
its  last  three  bits  because  they  each  apply  a  logic  0  to  bus  bj  that  now  carries  a  logic  1 .  In  the 
meantime,  however,  each  of  modules  c9  and  cio  disables  its  last  two  bits,  because  of  the  logic 
1  they  detect  on  bus  by.  At  the  end  of  stage  2,  therefore,  bus  bj  settles  to  the  logic  value  0, 
where  it  will  remain  for  the  rest  of  the  process.  As  a  result,  modules  eg  and  cjo  now  re-enable 
their  two  low  order  bits  (stage  3),  because  the  conflict  they  previously  detected  on  bus  62  had 
disappeared,  which  results  in  bus  hi  settling  to  a  logic  1  at  the  end  of  stage  3.  Finally,  in  stage 
4,  module  c9  ceases  to  apply  its  last  bit,  because  of  the  logic  value  1  it  now  detects  on  bus 
61 ,  which  results  in  bus  bo  settling  to  a  logic  0  at  the  end  of  stage  4.  This  arbitration  process 
required  t  =  lg  16  =  4  stages  to  complete. 
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Figure  3-1:  Asynchronous  binary  arbitration  process  with  4  busses.  The  competing  modules  are  c?, 
eg,  c»,  and  cio,  with  corresponding  arbitration  priorities  0010,  0101,  1001,  and  1010.  Bits  in  shaded 
regions  are  not  applied  to  the  busses.  The  arbitration  process  takes  4  stages. 

In  this  chapter  we  show  that  the  asynchronous  binary  arbitration  scheme  can  in  fact  be 
improved.  We  introduce  the  new  asynchronous  binomial  arbitration  scheme,  that  uses  one 
more  arbitration  bus  in  addition  to  the  lgn  busses  of  binary  arbitration,  but,  most  surprisingly, 
reduces  the  arbitration  time  to  5 lgn.  In  asynchronous  binomial  arbitration,  we  use  (lgn  +  1)- 
bit  codewords  as  arbitration  priorities  and  follow  the  same  arbitration  protocol  of  asynchronous 
binary  arbitration.  Our  binomial  arbitration  scheme  guarantees  fast  arbitration  by  employing 
certain  codewords  that  exhibit  small  data-dependent  delays  during  arbitration  processes.  For 
example,  by  using  the  following  set  of  5-bit  codewords  {00000,  00001,  00010,  00011,  00100, 
00110,  00111,  01000,  01100,  OHIO,  OUll,  10000,  llOOO,  lllOO,  lino,  lllll}  as  arbitration 
priorities,  we  can  arbitrate  among  16  modules  using  5  busses  in  at  most  2  stages.  Figure  3-2 
outlines  an  asynchronous  binomial  arbitration  process  among  four  such  modules  ej,  ce,  cjj,  and 
Cj2,  with  corresponding  arbitration  priorities  00001,  00111,  10000,  and  11000  from  the  above 
set  of  arbitration  priorities,  that  completes  in  2  stages.  It  turns  out  that  for  any  subset  of  the 
above  16  codewords,  the  corresponding  arbitration  process  never  takes  more  than  2  stages.  In 
Section  3.3,  we  show  how  to  design  a  good  set  of  codewords  for  general  values  of  n  by  using 
binomial  codes  as  arbitration  priorities. 

The  remainder  of  this  chapter  explores  priority  arbitration  schemes  that  employ  busses 
to  arbitrate  among  n  modules.  In  Section  3.2  we  discuss  distributed  priority  arbitration  and 
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Figure  3-2:  Asynchronous  binomial  arbitration  process  with  5  busses.  The  competing  modules  are 
ci,  C6,  ci i ,  and  cu,  with  corresponding  arbitration  priorities  00001,  00111,  10000,  and  11000.  Bits  in 
shaded  regions  are  not  applied  to  the  busses.  The  arbitration  process  takes  2  stages. 

formally  define  the  asynchronous  model  of  priority  arbitration  with  busses.  Section  3.3  describes 
the  two  known  asynchronous  schemes:  linear  arbitration  and  binary  arbitration ,  and  presents 
our  new  asynchronous  binomial  arbitration  scheme,  which  with  m  =  lg  n  +  1  busses  arbitrates 
in  t  =  \  lg  n  units  of  time.  In  Section  3.4  we  extend  binomial  arbitration  and  present  the 
generalized  binomial  arbitration  scheme  that  achieves  a  spectrum  of  bus-time  tradeoff  of  the 
form  m  =  ©(In1/4),  between  the  number  of  arbitration  busses  m  and  the  arbitration  time  t,  for 
values  of  1  <  t  <  lg  n  and  lg  n  <  m  <  n.  The  established  bus-time  tradeoff  is  of  great  practical 
interest,  enabling  system  designers  to  achieve  a  desirable  balance  between  amount  of  hardware 
and  speed.  In  Section  3.5  we  investigate  general  properties  of  asynchronous  priority  arbitration 
schemes  that  employ  busses  and  present  some  lower  bound  arguments  that  demonstrate  the 
efficiency  of  our  schemes.  Several  extensions  and  discussion  of  the  results  of  thi6  chapter  are 
presented  in  Section  3.6,  as  well  as  directions  for  further  research. 


3.2  Asynchronous  priority  arbitration  with  busses 

In  this  section  we  discuss  priority  arbitration  and  formally  define  the  asynchronous  model  of 
priority  arbitration  with  busses.  The  definitions  in  this  section  model  typical  implementations 
of  asynchronous  priority  arbitration  mechanisms  that  employ  busses. 
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Arbitration  is  the  process  of  selecting  one  module  from  a  set  of  contending  modules.  In 
asynchronous  priority  arbitration  with  busses,  each  module  is  assigned  a  unique  arbitration 
priority  —  an  encoding  of  its  name  —  which  is  used  in  determining  logic  values  to  apply  to  the 
busses  during  arbitration.  An  arbitration  protocol  determines  the  logic  values  that  a  competing 
module  applies  to  the  busses,  based  on  the  module’s  arbitration  priority  and  potentially  also  on 
logic  values  on  other  busses.  The  beginning  of  an  arbitration  process  is  generally  indicated  by  a 
system-wide  signal,  usually  called  REQUEST  or  ARBITRATE.  The  resolution  of  an  arbitration 
process  is  the  collection  of  settled  logic  values  on  the  busses  at  the  end  of  the  process,  which 
should  uniquely  identify  the  competing  module  having  the  highest  arbitration  priority. 

Throughout  this  chapter  we  use  the  following  notations  and  assumptions.  The  set  C  = 
{co,ci,. .  .,cn-i}  denotes  the  n  system  modules  (chips),  which  are  assumed  to  be  indexed  in 
increasing  order  of  priority.  The  m  wired-OR  (open-collector)  arbitration  busses  are  denoted 
by  B  =  {i>o,hi, . .  where  the  busses  are  indexed  in  increasing  order  of  significance 

(to  be  elaborated  later).  The  set  P  =  {po,Pi, •  •  .,pn-i}  consists  of  n  distinct  arbitration 
priorities  (in  increasing  order  of  priority),  with  pi  being  the  arbitration  priority  of  module  c,. 
Arbitration  priorities  are  only  a  convenient  mechanism  of  encoding  the  modules’  names,  and  in 
many  asynchronous  schemes  the  arbitration  priorities  are  m-bit  vectors  that  competing  modules 
apply  to  the  m  busses  during  arbitration.  When  necessary,  we  denote  the  bits  of  an  arbitration 
priority  p  by  p(°\  p<l\ p(2\  . . .,  in  order  of  increasing  significance.  We  assume  that  each  module 
is  connected  to  all  busses  and  can  thus  read  from  and  potentially  write  to  any  bus.  All  modules 
follow  the  same  arbitration  protocol  in  interfacing  with  the  busses  and  reaching  conclusions 
concerning  the  arbitration  process.  Finally,  we  assume  that  only  competing  modules  apply 
logic  values  to  the  busses;  noncompeting  modules  do  not  interfere  with  the  busses.  All  our 
assumptions  are  standard  design  practice  in  many  systems. 

3.2.1  Acyclic  arbitration  protocols 

In  asynchronous  priority  arbitration  with  busses,  we  restrict  the  arbitration  process  to  be  purely 
combinational  by  requiring  that  the  arbitration  logic  on  all  the  modules  together  with  the  ar¬ 
bitration  busses  form  an  acyclic  circuit.  Using  combinational  logic  with  asynchronous  feedback 
paths  may  introduce  race  conditions  and  metastable  states,  which  can  defer  arbitration  indef- 
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initely  (see  [2,  62,  72]).  The  acyclic  nature  of  the  arbitration  logic  imposes  a  partial  order  on 
the  busses,  corresponding  to  partitioning  the  busses  according  to  their  depth  in  the  arbitration 
circuitry.  This  partial  order  can  be  extended  to  a  linear  order,  by  having  busses  at  a  given 
depth  succeed  busses  of  greater  depth,  and  by  arbitrarily  ordering  busses  of  the  same  depth. 
With  a  linear  order  on  the  busses  in  mind,  the  acyclic  nature  of  the  arbitration  circuitry  can 
be  characterized  as  follows:  logic  values  on  higher  indexed  busses  may  be  used  to  determine 
logic  values  on  lower  indexed  busses,  but  not  vice  versa.  We  formalize  this  idea  in  the  following 
definition  of  an  acyclic  arbitration  protocol. 

Definition  9  Let  P  be  a  set  of  arbitration  priorities.  An  acyclic  arbitration  protocol  of  size  m 
for  P  is  a  sequence  F  =  . .  .,/i,/o)  of  m  functions,  fj-.Px  {0,  l}m"1_ji  -*■  {0,1},  for 

j  =  0,l,...,m-  1. 

In  asynchronous  priority  arbitration  with  busses,  every  module  has  arbitration  circuitry  that 
implements  the  same  acyclic  arbitration  protocol,  but  with  the  module’s  unique  arbitration 
priority  as  a  parameter.  The  m  arbitration  busses  are  linearly  ordered  from  &m-i  down  to  b0. 
in  accordance  with  the  acyclic  nature  of  the  circuit.  Informally,  function  takes  an  arbitration 
priority  p  €  P  and  m  -  1  -  j  bit  values  on  the  highest  m  -  1  -  j  busses  6m_i  through  6J+i, 
and  determines  the  bit  value  that  a  competing  module  c  with  arbitration  priority  p  applies 
to  bus  bj ,  for  j  =  0,  l,...,m  -  1.  Collectively,  an  acyclic  arbitration  protocol  F  of  size  m 
can  be  interpreted  as  a  function  F  :  P  x  {0,  l}m  — *  {0,  l}m,  that  determines  the  sequence  of 
m  logic  values  that  a  competing  module  c  with  arbitration  priority  p  applies  to  the  m  busses 
when  detecting  a  certain  configuration  of  logic  values  on  the  m  busses.  (Notice  that  not  every 
function  from  {0,  l}m  to  {0,  l}m  constitutes  an  acyclic  arbitration  protocol  of  size  m;  it  has  to 
satisfy  the'requirements  of  Definition  9.) 

An  arbitration  process  among  several  contending  modules  consists  of  the  modules  indepen¬ 
dently  applying  logic  values  to  the  m  busses,  according  to  an  acyclic  ai  nitration  protocol  F  of 
size  m,  until  all  the  busses  reach  stable  logic  states.  Since  acyclic  arbitration  protocols  have 
no  feedback  paths,  it  is  guaranteed  that  any  arbitration  process  among  contending  modules 
will  terminate  after  a  finite  number  of  steps.  To  formally  define  and  analyze  arbitration  pro¬ 
cesses,  however,  we  first  need  to  discuss  some  means  of  measuring  the  time  for  asynchronous 
arbitration  mechanisms  with  busses. 
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3.2.2  Bus-settling  delay:  a  unit  of  time 

Measuring  the  arbitration  time  of  asynchronous  mechanisms  is  somewhat  problematic.  We 
follow  a  standard  approach  taken  in  many  bus  systems  (see  [16,  22,  23,  40,  42,  80,  81j)  and 
measure  the  arbitration  time  in  units  of  bus-settling  delay.  The  time  unit  of  bus-settling  delay, 
typically  denoted  by  7\, u,,  is  the  time  it  takes  for  a  bus  to  settle  to  a  stable  logic  value,  once  its 
drivers  have  stabilized.  This  time  includes  the  delays  introduced  by  the  logic  gates  driving  the 
bus,  the  bus  propagation  delay,  and  any  additional  time  required  to  resolve  transient  effects. 
In  effect,  we  model  an  open-collector  bus  as  an  OR  gate  with  delay  T\, u»,  the  time  it  takes  for 
the  output  of  the  gate  to  stabilize  on  a  valid  logic  value,  once  its  inputs  have  reached  their  final 
values.  This  approach  models  the  situation  in  many  bus  systems  rather  accurately. 

High  speed  busses  are  commonly  modeled  as  analog  transmission  fines,  where  it  takes  finite 
amount  of  time  for  signals  to  propagate  through  the  bus  and  bring  the  bus  to  a  stable  logic 
value.  Since  busses  carry  analog  signals,  the  logic  value  on  a  bus  cannot  be  used  (and  in  fact 
is  undefined)  before  the  bus  reaches  a  stable  digital  value.  In  addition,  the  response  time  of 
logic  gates  driving  the  busses  and  several  transient  effects  need  to  be  considered.  In  particular, 
the  effect  of  the  wired-OR  glitch  on  bus-settling  time  and  the  use  of  special  integration  logic  at 
module  receivers  to  reduce  this  effect  (see  [5,  18,  42,  81])  indicate  that  the  logic  value  on  a  bus 
may  not  be  used  before  a  unit  of  time,  bus-settling  delay,  passes. 

Some  authors  carry  out  a  more  elaborate  analysis  of  high  speed  busses,  where  they  take 
into  account  distances  between  modules  on  the  bus  and  impose  certain  restrictions  on  the 
ordering  of  modules.  Taub  [79,  80,  81],  for  example,  assumes  a  geographical  ordering  of  modules 
by  increasing  priorities  and  equal  distances  between  modules  on  a  bus.  Counterexamples  to 
Taub’s  analysis,  where  these  requirements  are  not  met,  were  found  [3,  87].  In  Chapter  4,  we 
introduce  and  use  a  digital  transmission  line  model  for  a  bus  that  takes  into  account  distances 
and  signal  propagation.  In  this  chapter,  however,  our  model  for  the  settling  of  a  digital  bus 
makes  no  restricting  assumptions  and  is  applicable  to  wide  classes  of  systems,  where  priorities 
and  module  locations  are  not  fixed  or  predetermined. 

Using  our  model  of  a  wired-OR  (open-collector)  bus  as  a  delay  element  that  exhibits  a  delay 
of  7bu»,  we  can  now  model  an  arbitration  process  as  a  sequence  of  applications  of  an  acyclic 
arbitration  protocol,  where  each  such  application  completes  in  one  Tbm  time. 
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3.2.3  Arbitration  processes 

We  next  formally  define  the  notion  of  an  arbitration  process  of  an  acyclic  arbitration  protocol 
on  a  set  of  competing  arbitration  priorities.  We  characterize  the  arbitration  process  by  the 
collection  of  the  logic  values  on  the  m  busses  at  the  end  of  each  computation  stage.  We  use  «,•[/] 
to  denote  the  logic  value  on  bus  bj  at  the  end  of  the  /th  computation  stage,  for  j  =  0, 1, . . . ,  m  - 1 
and  l  =  0, 1,. . ..  Without  loss  of  generality,  we  assume  that  an  arbitration  process  begins  with 
all  busses  being  in  logic  value  0. 

Definition  10  Let  P  be  a  set  of  arbitration  priorities,  F  be  an  acyclic  arbitration  protocol  of 
size  m  for  P,  and  Q  C  P  be  a  set  of  competing  arbitration  priorities.  The  arbitration  process 
of  F  on  Q  is  the  successive  evaluation  of 

vj  [0]  =  0  , 

Vj[l  +  1]  =  \/  fj(p,  vm_i[/], . . .,  vJ+1[/])  , 

peQ 

for  j  =  0, 1, . . .,  m  -  1  and  /  =  0,1,....  We  say  that  the  arbitration  process  takes  t  stages  if 
t  >  0  is  the  smallest  integer  for  which  »,•[<]  =  v7[f  +  1],  for  j  =  0, 1, . .  .,m  -  1.  The  resolution 
of  the  arbitration  process  is  the  stable  configuration  of  values  (rm_i[t], . . .,  t>j [t] ,  i?o[t]). 

Definition  10  characterizes  an  arbitration  process  as  a  sequence  of  successive  applications 
of  the  acyclic  arbitration  protocol  F  to  the  set  of  competing  arbitration  priorities  Q  and  the 
configuration  of  the  m  busses.  The  arbitration  process  terminates  when  no  more  changes  in 
the  state  of  the  busses  occur,  at  which  point  a  resolution  is  reached.  One  can  verify  that  any 
arbitration  process  of  an  acyclic  arbitration  protocol  F  of  size  m  takes  at  most  m  stages.  This 
is  the  case  because  at  each  computation  stage  of  an  arbitration  process  of  an  acyclic  arbitration 
protocol,  at  least  one  more  bus  stabilizes  on  its  final  value. 

A  better  upper  bound  for  the  number  of  stages  taken  by  arbitration  processes  can  be  given 
by  the  depth  of  the  acyclic  arbitration  protocol.  As  discussed  above,  the  acyclic  nature  of  the 
arbitration  logic  imposes  a  partial  order  on  the  busses.  We  can  therefore  statically  partition 
the  m  busses  into  d  levels,  such  that  the  computation  for  a  bus  in  a  certain  level  uses  only 
the  values  of  busses  in  previous  levels.  More  formally,  given  an  acyclic  arbitration  protocol  F 
of  size  m,  we  can  simultaneously  partition  the  m  functions  of  F  into  d  nonempty  disjoint  sets 
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Fo,  F\, . . Fi- 1,  and  the  m  busses  of  B  into  d  corresponding  sets  Bq ,  B\, .  ..,Bd-\ ,  such  that 
fj  €  Fh  if  and  only  if  bj  6  B *,  for  0  <  j  <  m  —  1,  and  0  <  h  <  d  —  1.  The  partition  must 
have  the  property  that  the  computation  of  a  function  fj  €  F\  depends  only  on  the  arbitration 
priorities  and  on  values  of  busses  in  sets  Bo,  B\, . . .,  B^-i-  The  depth  of  an  acyclic  arbitration 
protocol  F  of  size  m  is  defined  as  the  smallest  d ,  for  which  a  partition  as  above  exists.  The 
depth  of  an  acyclic  arbitration  protocol  is  never  greater  than  its  size,  since  placing  each  bus  in 
a  separate  level  satisfies  the  requirements  of  the  above  partition  and  the  number  of  levels  in 
this  partition  is  the  size  of  the  protocol.  The  next  theorem  shows  that  any  acyclic  arbitration 
protocol  of  depth  d  reaches  a  resolution  after  at  most  t  =  d  computation  stages. 

Theorem  22  Let  P  be  a  set  of  arbitration  priorities,  F  be  an  acyclic  arbitration  protocol  of 
size  m  for  P,  and  d  be  the  depth  of  F.  Then,  for  any  subset  Q  C  P  of  competing  arbitration 
priorities,  the  arbitration  process  of  F  on  Q  takes  at  most  d  stages. 

Proof.  By  induction  on  d ,  the  depth  of  the  acyclic  arbitration  protocol  F. 

Base  case:  d  =  0.  For  depth  d  =  0,  there  are  no  arbitration  busses  and  the  claim  holds 
immediately  for  arbitrary  Q. 

Inductive  case:  d  >  0.  Given  an  acyclic  arbitration  protocol  F  =  (fm-i,  •  •  -,fi,fo)  of  size 
m  and  depth  d  for  P,  we  can  partition  F  =  \jf,Z.lQ  Fj,  and  B  —  Uj[=o  Fh  as  discussed  above. 
Without  loss  of  generality,  we  assume  that  the  last  level  consists  of  the  r  functions  and  busses 
with  indices  0, 1, . . .,  r  -  1.  The  first  d—  1  levels  of  F  constitute  an  acyclic  arbitration  protocol 
F'  =  ULlFh  =  (fm-u...  ,fr+ufr)  of  size  m  -  r  and  depth  d  -  1  for  P.  By  induction,  the 
arbitration  process  of  F'  on  Q  takes  at  most  d  —  1  stages.  That  is,  for  any  r  <  j  <  m  —  1  and 
l  >  d  —  1,  we  have  =  vj[d  -  1].  In  addition,  according  to  the  acyclic  arbitration  protocol 
F,  we  also  have  that  for  any  0  <  i  <  r  -  1  and  k  >  d  >  0, 

».•[*]  =  V  /•(jM'm-iI*- !]) 

P€Q 

=  \/  fi{p,vm.i[d-l],...,vr[d- \}) 

P€Q 

=  , 

because  the  dth  level  depends  only  on  busses  6m_i  down  to  6r  and  because  k  —  1  >  d  —  1.  This 
proves  that  the  arbitration  process  takes  at  most  d  stages.  H 
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Theorem  22  shows  that  the  number  of  stages  that  an  arbitration  process  takes  is  bounded 
by  the  depth  of  the  acyclic  arbitration  protocol  F.  This  bound  represents  a  standard  static 
approach  in  the  analysis  of  delays  in  digital  circuits,  namely,  that  of  counting  the  number 
of  gates  on  the  longest  path  from  the  inputs  to  the  outputs.  In  later  sections  of  this  chapter, 
however,  we  introduce  and  use  a  novel  dynamic  approach  of  bounding  the  number  of  stages  that 
an  arbitration  process  takes  by  a  careful  analysis  of  the  data-dependent  delays  experienced  in 
the  arbitration  circuits.  In  doing  so,  we  exhibit  arbitration  schemes  that  guarantee  termination 
of  any  arbitration  process  in  a  circuit  of  size  and  depth  m  after  a  fixed  number  of  stages  t,  for 
values  of  t  in  the  range  0  <  t  <  m. 

3.2.4  Asynchronous  priority  arbitration  schemes 

To  complete  the  definition  of  asynchronous  priority  arbitration  schemes,  we  need  to  introduce 
the  notion  of  an  interpretation  function.  Suppose  we  have  a  set  of  arbitration  priorities  P  and 
an  acyclic  arbitration  protocol  F  of  size  m  for  P.  An  interpretation  function  for  P  and  F  is  a 
function  win  :  {0,  l}m  -*  P,  such  that  for  any  Q  C  P,  with  p  €  Q  being  the  highest  arbitration 
priority  in  Q  and  . . .,  «i,«o)  being  the  resolution  of  the  arbitration  process  of  F  on  Q , 

we  have  WIN(rm_i, . . . ,  V|,  t»o)  =  p.  Informally,  the  function  win  interprets  the  resolution  of 
any  arbitration  process  of  F  by  identifying  the  highest  competing  arbitration  priority.  We  are 
now  ready  to  define  an  asynchronous  priority  arbitration  scheme  for  n  modules,  m  busses,  and 
t  stages. 

Definition  11  An  asynchronous  priority  arbitration  scheme  for  n  modules,  m  busses,  and  t 
stages  is  a  triplet  A(n,m,t)  =  (P,F,  win)  ,  where 

1.  P  is  a  set  of  n  arbitration  priorities ; 

2.  F  is  an  acyclic  arbitration  protocol  of  size  m  for  P; 

3.  win  is  an  interpretation  function  for  P  and  F; 

such  that  for  any  Q  C  F,  the  arbitration  process  of  F  on  Q  takes  at  most  t  stages. 

Definition  11  emphasizes  the  role  of  the  arbitration  priorities,  which  are  just  a  mechanism 
to  distinguish  between  different  modules.  It  will  become  apparent,  however,  that  careful  design 
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of  the  codewords  used  as  arbitration  priorities  has  a  significant  impact  on  the  arbitration  time. 
In  the  next  Section,  for  example,  we  demonstrate  that  by  using  the  set  of  (lg  n  +  l)-bit  binomial 
codes  as  arbitration  priorities,  we  can  achieve  an  arbitration  time  of  t  =  ^  lg  n. 


3.3  Asynchronous  priority  arbitration  schemes 

In  this  section  we  first  describe  two  commonly  used  asynchronous  priority  arbitration  schemes: 
linear  arbitration,  which  with  m  =  n  busses  arbitrates  in  time  t  =  1,  and  binary  arbitration, 
which  with  m  =  lg  n  busses  arbitrates  in  time  t  =  lg  n.  We  then  present  our  new  asynchronous 
scheme,  binomial  arbitration,  which  with  m  •-  lg  n  +  1  busses  arbitrates  in  time  t  =  |lgn. 

3.3.1  The  linear  arbitration  scheme 

This  scheme  uses  m  =  n  busses  and  arbitrates  among  n  modules  in  t  =  1  stages.  To  arbitrate, 
contending  module  c,  applies  a  1  to  bus  6,,  for  0  <  »  <  n  —  1,  and  does  not  interfere  with  other 
busses.  This  translates  to  module  c,  having  an  n-bit  arbitration  priority  p such  that  p[j)  =  1 
if  i  =  j  and  p\^  —  0  otherwise.  After  t  =  1  units  of  time,  all  the  busses  stabilize  on  their  final 
values,  and  the  module  with  a  1  on  the  bus  with  the  highest  priority  is  recognized  as  the  winner. 
This  scheme  can  also  be  implemented  with  tri-state  busses,  since  at  most  one  module  writes  to 
any  given  bus.  The  scheme  is  also  known  as  decoded  arbitration  and  is  used  in  a  number  of  bus 
systems  and  interrupt  arbitration  mechanisms  (see  [22,  24,  57,  82]). 

Formally,  we  define  this  scheme  as  Linear(ti,  n,  1)  =  (P,  F,  win),  where 

1.  P  =  {p,’=  0n-1-’  1  0i  :  for  «  =  0,l,...,n- 1}; 

2.  F  =  </„_i . /i,/o),  where  /,(p,t’n-i  ...,vJ+l)  =  pb),  for  j  =  0, 1 . n  -  1; 

3.  wiN(0fc  1  a)  =  0*  1  0n-1-fc  =  pn_i_k,  for  0  <  k  <  n  -  1  and  any  a  €  {0,  l}n-1-fc. 

Notice  that  although  the  size  of  the  acyclic  arbitration  protocol  of  Linear  is  m  =  n,  its 
depth  is  only  d  =  1,  which  according  to  Theorem  22  implies  that  the  asynchronous  linear 
arbitration  scheme  takes  at  most  t  =  1  stages  to  arbitrate. 
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3.3.2  The  binary  arbitration  scheme 

This  scheme  uses  m  =  fig  n]  busses  and  arbitrates  among  n  modules  in  t  =  fig  nl  stages.  The 
arbitration  priority  p,  of  module  c,  is  the  binary  representation  of  i,  for  0  <  i  <  n  -  1.  To 
arbitrate,  contending  module  c  drives  its  binary  priority  p  onto  the  m  busses,  from  p(m-1>  (the 
most  significant  bit  of  p)  onto  bus  ftm_i,  down  to  p(°)  (the  least  significant  bit  of  p)  onto  bus 
fto;  the  result  being  the  bitwise  OR  of  the  binary  priorities  of  the  competing  modules.  During 
arbitration,  each  competing  module  c  monitors  the  busses  and  disables  its  drivers  according  to 
the  following  rule:  let  p^  be  the  /th  bit  of  the  binary  priority  p,  and  let  vj  be  the  binary  value 
observed  on  bus  ft/,  for  0  <  l  <  m  —  1.  Then  if  pM  =  0  and  v/  =  1,  module  c  disables  all  its  bits 
pW  for  j  <  l.  Disabled  bits  are  re-enabled  should  the  condition  cease  to  hold.  After  t  =  fig  n] 
units  of  time,  all  the  busses  stabilize  on  their  final  values,  and  the  module  whose  arbitration 
priority  appears  on  the  busses  is  the  winner.  This  scheme  was  developed  by  Taub  [79],  and  is 
also  known  as  encoded  arbitration  (see  [16,  22,  40,  80,  81]). 

Formally,  we  define  this  scheme  BlNARY(n,  flgn]  ,  flgn])  =  (P,  F,  win)  as  follows.  For 
simplicity  of  notation  we  use  m  =  fig  n] . 

1-  P  =  {Pt  =  fm-i  ••  •  c0  ’•  where  €m_i’--Ci€o  is  the  binary  representation  of  i ,  for 
*  =  0,1 . n  -  1}; 


2.  F  =  (/m-i ,  • .  • ,  fu  fo),  where 

f}(P,Vm-l  ■■■■.Vj+i) 

for  j  -  0,  l,...,m-  1; 


0  if  (Pil)  =  0  A  vi  =  l)  , 

pD)  otherwise  , 


3.  wiN(a)  =  a,  for  any  a  €  {0,  l}m. 

Notice  that  the  size  m  and  the  depth  d  of  the  acyclic  arbitration  protocol  of  Binary  are 
equal,  specifically  m  —  d  ~  flgn].  This  can  be  verified  by  noticing  that  the  computation 
for  each  bus  bj,  where  0  <  j  <  m  —  1,  takes  into  account  values  on  busses  ft/,  for  j  <  l  < 
m  —  1.  This  implies,  according  to  Theorem  22,  that  the  asynchronous  binary  arbitration 
scheme  takes  at  most  t  =  m  =  fig  n]  stages  to  arbitrate.  On  the  other  hand,  it  has  been 
shown  in  [22,  23,  80,  81,  88]  that  there  are  examples  where  a  binary  arbitration  process  takes 
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exactly  fig  n]  stages.  (Figure  3-1  presents  such  an  example  for  n  =  16  modules,  m  =  flgn]  =4 
busses,  and  t  =  m  =  4  stages.)  These  examples  consist  of  arbitrating  among  bad  subsets 
of  arbitration  priorities,  where  at  each  stage  the  binary  value  of  exactly  one  more  bit  of  the 
highest  competing  binary  priority  is  resolved.  The  asynchronous  binomial  arbitration  scheme, 
presented  next,  guarantees  fast  arbitration  by  employing  only  certain  codewords  that  exhibit 
small  data-dependent  delays. 

3.3.3  The  binomial  arbitration  scheme 

This  scheme  uses  m  =  fig  n  +  1]  busses  to  arbitrate  among  n  modules  in  t  =  lg  nj  stages. 
This  scheme’s  acyclic  arbitration  protocol  and  interpretation  function  are  identical  to  those  of 
the  binary  arbitration  scheme,  and  thus  the  same  hardware  can  be  used.  The  only  difference 
is  that  binomial  codes  are  used  as  arbitration  priorities  rather  than  all  the  2m  possible  m-bit 
codewords  of  binary  arbitration.  Alternatively,  with  m  busses,  this  scheme  can  arbitrate  among 
2m_1  modules  in  t  =  [j(m  -  1)]  stages.  We  next  describe  the  binomial  codes  and  begin  by 
defining  the  interval-number  of  a  binary  codeword. 

Definition  12  The  interval-number  of  a  binary  codeword  p  is  the  number  of  intervals  of  con¬ 
secutive  l’s  or  0’s  that  it  contains,  disregarding  leading  0’s. 

Thus,  for  example,  the  interval- number  of  001011  is  3,  the  interval-number  of  0000  is  0,  and 
the  interval-number  of  10101010  is  8.  In  general,  an  m-bit  binary  codeword  p  with  interval- 
number  r,  has  the  form  p  =  0molm,0mjlm3  •  •  ■6mr,  where  6  €  {0,1};  mo  >  0;  m7  >  0  for 
1  <  j  <  r;  and  52j=o  mi  =  m-  We  next  define  the  binomial  codes  of  length  m. 

Definition  13  The  set  of  binomial  codes  of  length  m,  denoted  by  D{m),  is  the  set  of  all  the 
m-bit  binary  codewords  that  have  interval-number  at  most  |"|(m  -  l)j. 

The  binomial  codes  of  length  m  are  in  fact  all  the  m-bit  codewords,  that,  after  deleting 
leading  0’s  have  at  most  ||(m-  l)j  intervals  of  consecutive  l’s  or  0’s.  For  example,  the 
binomial  codes  of  length  4  is  D( 4)  =  {0000,  0001,  0010,  0011,  0100,  0110,  0111,  1000,  1100, 
1110,  1111},  consisting  of  11  codewords  that  have  interval-number  at  most  2.  As  another 
example,  the  binomial  codes  that  were  used  in  the  Section  3.1  (the  example  of  Figure  3-2)  are 
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D( 5)  =  {00000,  00001,  00010,  00011,  00100,  00110,  00111,  01000,  01100,  OHIO,  01111,  10000, 
11000, 11100, 11110, 11111},  consisting  of  the  16  codewords  of  length  5  with  interval- number  at 
most  2.  For  general  values  of  m,  Corollary  24  in  Section  3.4  shows  that  there  are  at  least  2m_1 
binomial  codes  of  length  m.  By  taking  m  =  fig  n  -f  l],  this  translates  to  at  least  2fan+1l~1  >  n 
binomial  codes,  which  means  that  there  are  enough  arbitration  priorities  for  n  modules. 

Formally,  we  define  this  scheme  BlNOMlAL(n,  fig  n  +  1] ,  |"i  lg  nj )  =  (P,F,  win)  as  follows. 
We  use  m  =  flgn  +  1]  and  t  =  [jlgnj  for  simplicity  of  notation. 

1.  P  =  D(m)\ 


2.  F  =  (/m-i,.-.,/i,/o),  where 


/j(p,vm_ i  ...,uJ+i) 


0  if  Vfl'j,  (p(,)  =  0  A  v,  =  l)  , 
otherwise  , 


for  j  -  0,  l,...,m-  1; 


3.  win(o)  =  a,  for  any  a  €  {0,  l}m. 

It  remains  to  show  that  the  asynchronous  binomial  arbitration  scheme  indeed  arbitrates 
among  n  modules  in  at  most  t  =  fjlgnj  stages.  Notice  that  a  standard  static  analysis  of  the 
arbitration  circuitry,  as  given  for  example  in  Theorem  22,  does  not  give  the  desired  result,  since 
both  the  size  and  the  depth  of  the  acyclic  arbitration  protocol  F  of  binomial  arbitration  are  m  = 
d  =  fig  n  -f  1] .  In  Section  3.4,  we  use  a  novel  dynamic  approach  of  analyzing  the  data-dependent 
delays  experienced  in  arbitration  processes,  and  prove  the  correctness  of  the  asynchronous 
binomial  arbitration  scheme  as  a  special  case  of  the  generalized  binomial  arbitration  scheme. 


3.4  Generalized  Binomial  Arbitration 

In  this  section  we  extend  the  ideas  of  the  asynchronous  binomial  arbitration  scheme  by  pre¬ 
senting  the  generalized  binomial  arbitration  scheme  that  with  m  busses  and  in  at  most  t  stages 
arbitrates  among  n  =  o  (?)  modules.  By  Stirling’s  approximation,  the  asymptotic  bus-time 
tradeoff  of  generalized  binomial  arbitration  is  m  %  This  bus-time  tradeoff  is  of  great 

practical  interest,  enabling  system  designers  to  achieve  a  desirable  balance  between  amount  of 
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hardware  and  speed.  The  performance  of  generalized  binomial  arbitration  is  based  on  analysis 
of  data-dependent  delays. 


3.4.1  Generalized  binomial  codes 

We  first  extend  Definition  13  by  defining  the  set  of  generalized  binomial  codes  of  length  m  and 
diversity  r. 

Definition  14  The  set  of  generalized  binomial  codes  of  length  m  and  diversity  r,  denoted  by 
(?(m,r),  is  the  set  of  all  m-bit  binary  codewords  that  have  interval- number  at  most  r. 

Generalized  binomial  codes  serve  as  arbitration  priorities  for  the  generalized  binomial  ar¬ 
bitration  scheme.  The  next  lemma  determines  the  cardinality  of  the  set  of  the  generalized 
binomial  codes  of  length  m  and  diversity  r. 

Lemma  23  The  set  G(m,r)  contains  £/~o  (?)  distinct  codewords. 

Proof.  To  simplify  the  counting,  we  take  all  the  codewords  in  G(m,  r)  and  append  a  0  at  their 
beginning.  This  results  in  a  set  of  (m  +  l)-bit  words,  that  begin  with  a  0  and  have  at  most 
r  switching  points  from  a  consecutive  interval  of  0’s  to  a  consecutive  interval  of  l’s  and  vice 
versa.  The  number  of  such  words  is  £J_ 0  (7)»  since  for  any  0  <  /  <  r  there  are  exactly  (7) 
possibilities  of  choosing  l  switching  points  out  of  m  possible  positions.  I 


Corollary  24  There  are  at  least  2m_1  binomial  codes  of  length  m. 


Proof.  By  our  notation,  the  set  D(m)  of  binomial  codes  of  length  m,  is  defined  by  D(m)  = 
G(m,  j^(m  -  l»)j).  According  to  Lemma  23,  we  have 


\D{m)\ 


f  j  (m~l)l  /  \ 

■  §  7  • 


This  sum  includes  the  first  J  i(m  -  1)|  +  1  binomial  coefficients,  which  constitute  at  least  a  half 
of  all  the  m  + 1  binomial  coefficients  (7)-  Since  the  binomial  coefficients  are  symmetric,  that  is, 
(7)  =  above  partial  sum  is  at  least  a  half  of  the  full  sum,  which  is  2m.  We  therefore 

conclude  that  |D(m)|  >  i  •  2m  =  2m_1.  I 
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3.4.2  The  generalized  binomial  arbitration  scheme 

This  scheme  uses  m  busses  and  arbitrates  in  at  most  t  stages,  for  0  <  t  <  m.  With  the  m  and  t 
parameters  determined,  this  scheme  can  arbitrate  among  at  most  n  =  £‘_0  (™)  modules.  The 
acyclic  arbitration  protocol  and  the  interpretation  function  of  this  scheme  are  identical  to  those 
of  the  binary  arbitration  scheme  of  Section  3.3.2,  and  thus  the  same  hardware  can  be  used.  The 
only  difference  is  that  generalized  binomial  codes  from  G(m,t)  are  used  as  arbitration  priorities. 

Formally,  we  define  this  scheme  Generalized-Binomial^, m, t)  =  (P,F, win),  where 
n  =  £/=o  (7)’  48  Allows. 

1.  P  —  (7(m,f); 


2.  F  =  /o),  where 


/>(P.  Vm— 1  •  •  •  >  pj'+X  ) 


o  if  V£7i,  (p(0  =  o  a  »,  =  i)  , 

otherwise  , 


for  j  =  0,  l,...,m-  1; 


3.  wiN(ct)  s=  a,  for  a  €  (0,  l}m. 

The  idea  behind  generalized  binomial  arbitration  is  that  the  interval-number  of  the  highest 
competing  arbitration  priority  bounds  the  number  of  arbitration  stages.  In  binary  arbitration, 
where  all  the  2m  possible  m-bit  codewords  are  used,  there  are  arbitration  processes  that  can  take 
as  many  as  m  stages,  where  at  each  stage  one  more  bit  of  the  highest  competing  arbitration  pri¬ 
ority  is  resolved.  For  generalized  binomial  arbitration,  however,  we  select  codewords  that  have 
at  most  t  intervals  of  consecutive  l’s  or  0’s.  The  following  theorem  uses  data-dependent  analysis 
to  argue  that  any  arbitration  process  takes  at  most  r  stages,  where  r  is  the  interval-number 
of  the  highest  competing  arbitration  priority,  by  showing  that  at  each  stage  the  arbitration 
process  resolves  at  least  one  more  interval  of  consecutive  bits,  rather  than  a  single  bit. 


Theorem  25  Consider  a  generalized  binomial  arbitration  process  on  m  busses.  Let  Q  be  the 
set  of  competing  arbitration  priorities,  p  be  the  highest  arbitration  priority  in  Q,  and  r  be  the 
interval-number  of  p.  Then  after  s  stages,  for  any  s  >  r,  bus  bj  carries  the  logic  value  p^\  for 
0  <  j  <  m  —  1 . 
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Proof.  We  prove  the  theorem  by  induction  on  r  for  arbitrary  values  of  m.  We  use  the  notation 
Vj[k]  to  denote  the  logic  value  on  bus  bj  at  the  end  of  stage  Jfc,  for  j  =  0,1,..., m  —  1  and 

fc  =  0,1,.... 

Base  case:  r  =  0.  The  codeword  p  consists  of  m  consecutive  0’s,  that  is,  pW  =  0  for 
j  =  0, 1,. ..,m  —  1.  Since  p  is  the  highest  arbitration  priority  in  Q,  then  any  q  €  Q  must  also 
have  qlrt  =  0  for  j  =  0, 1, . . .,  m  —  1.  By  our  assumption  that  all  the  m  busses  are  initially  in 
logic  value  0,  and  since  according  to  the  acyclic  arbitration  protocol  no  module  ever  applies  a 
1  to  any  of  these  busses,  the  m  busses  remain  in  logic  value  0  forever.  In  other  words,  after  s 
stages,  for  any  s  >  r  =  0,  we  have  v,[s]  =  «j[0]  =  0  =  p^3\  for  j  =  0, 1, . . . ,  m  -  1,  which  proves 
the  claim. 

Inductive  case:  r  >  0.  The  codeword  p  has  m  bits  and  interval- number  r,  and  is  thus  of 
the  form  p  =  omolm»0milm*  •  •  -6mr,  where  6  €  {0,1};  m0  >  0;  m,  >  0  for  1  <  j  <  r;  and 
]Cj=o  m]  —  m-  We  first  concentrate  on  the  first  r  -  l  intervals  of  p,  and  define  the  set  R  of 
reduced  codewords  of  length  m  =  m  —  mr  =  by  ignoring  the  last  mr  bits  of  the 

codewords  of  Q.  One  can  verify  that  p,  the  reduced  version  of  p,  is  the  highest  codeword  in 
R,  because  we  discarded  the  mr  least  significant  bits  of  codewords  in  Q.  Furthermore,  the 
interval-number  of  p  is  r  -  1,  since  the  last  interval  of  p  of  the  form  6mr  was  ignored.  By 
applying  the  claim  inductively  with  m  busses,  the  set  of  competing  arbitration  priorities  iZ, 
and  the  highest  arbitration  priority  p  of  interval-number  r  —  1,  we  find  that  after  r  -  1  stages 
the  most  significant  m  =  m  -  mT  busses  stabilize  to  the  bits  of  p.  That  is,  for  any  k  >  r  -  1,  we 
have  =  «j[r  —  1]  =  p ^  =  p^3\  for  mr  <  j  <  m  —  1.  We  now  consider  the  last  mr  busses, 
bmr- 1,. .  .,&i,60.  There  are  two  cases  to  consider: 

<5  =  1  The  rth  interval  of  p  is  an  interval  of  mr  consecutive  l’s,  that  is,  p(‘l  =  1  for  t  = 
0, 1, . . .,  mf  —  1.  After  k  stages,  for  any  k  >  r  —  1,  the  most  significant  m  -  mT  busses 
carry  the  bits  of  p,  and  therefore  there  is  no  l  in  the  range  0  <  /  <  m  -  1,  with  r/[fr]  =  1 
and  p^l  =  0.  As  a  result,  the  module  with  arbitration  priority  p  applies  all  its  last  mr 
consecutive  l’s.  Therefore,  for  any  s  >  r  and  j  =  0, 1, . . .,  mr  -  1,  we  have  v,[s]  =  t’;[r]  = 
1  =  p*‘\  since  the  busses  implement  a  wired-OR  in  one  stage. 

6  =  0  The  rth  interval  of  p  is  an  interval  of  mT  consecutive  0's,  that  is,  pbl  =  0  for  :  = 
0, 1, . . .,  m,  -  1.  Since  p  is  the  highest  arbitration  priority  in  Q,  then  for  any  arbitration 
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priority  q  £  Q,  q  ^  p,  there  must  exist  an  l  in  the  range  mT  <  /  <  m  -  1,  with  p(,)  =  1 
and  q M  =  0.  After  fc  stages,  for  any  A:  >  r  -  1,  the  most  significant  m  -  mr  busses 
carry  the  bits  of  p,  and  therefore  any  module  with  arbitration  priority  q  ^  p  disables 
at  least  its  last  mr  bits.  As  a  result,  for  any  s  >  r  and  t  =  0,  —  1,  we  have 

Ui[s]  =  r,[r]  =  0  =  p(‘\  because  the  busses  implement  a  wired-OR  in  one  stage  and  no 
module  applies  a  1  to  busses  bo  through  6mr-i  anymore. 

Thus,  after  s  stages,  for  s  >  r,  the  m  busses  carry  the  corresponding  bits  of  p.  H 

The  following  corollary  shows  that  by  taking  G(m,t),  the  generalized  binomial  codes  of 
length  m  and  diversity  t,  as  arbitration  priorities,  we  guarantee  that  any  arbitration  process 
completes  in  at  most  t  stages. 

Corollary  26  Consider  GENERALIZED-BtNOMlAL(n,m,t),  the  generalized  binomial  arbitra¬ 
tion  scheme.  For  any  subset  of  arbitration  priorities  Q  C  G(m ,  t),  the  corresponding  arbitration 
process  takes  at  most  t  stages. 

Proof.  Let  p  be  the  highest  arbitration  priority  in  Q.  Since  the  interval-number  of  p  is  at  most 
t,  Theorem  25  guarantees  that  the  arbitration  process  on  Q,  with  p  as  the  highest  arbitration 
priority,  takes  no  more  than  t  stages.  I 

3.4.3  Tradeoff  of  generalized  binomial  arbitration 

The  generalized  binomial  arbitration  scheme  achieves  a  bus-time  tradeoff  of  the  form  n  = 
Zj=o(?)>  which  by  Stirling’s  formula  exhibits  asymptotic  behavior  m  %  i tn1 Figure  3-3 
demonstrates  this  bus-time  tradeoff  for  a  system  with  n  modules.  The  horizontal  axis  represents 
m,  the  number  of  arbitration  busses  used,  which  varies  from  m  =  lg  n  to  m  =  n.  The  arbitration 
time  t,  measured  in  units  of  bus-settling  delay  (arbitration  stages),  is  marked  on  the  vertical 
axis.  The  arbitration  time  varies  between  t  =  1  to  t  =  lgn  stages.  Generalized  binomial 
arbitration  reduces  to  binary  arbitration  with  m  =  lgn  busses,  to  binomial  arbitration  with 
m  =  lgn  +  1  busses,  and  to  a  modified  version  of  linear  arbitration  (see  Section  3.5.2  for  the 
canonical  form  of  linear  arbitration)  with  m  =  n  busses. 

Figure  3-3  demonstrates  that  neither  linear  arbitration  nor  binary  arbitration  efficiently 
utilize  the  resources.  For  example,  increasing  the  number  of  busses  used  in  binary  arbitration  bv 
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Figure  3-3.  Bus-time  tradeoff  of  the  generalized  binomial  arbitration  scheme  for  n  modules,  using 
Ign  <  m  <  n  busses  and  1  <  t  <  Ig n  stages. 


one,  results  in  speeding  up  the  arbitration  process  by  a  factor  of  2,  as  exhibited  by  our  binomial 
arbitration  scheme.  On  the  other  hand,  allowing  another  time  unit  over  linear  arbitration 
enables  reducing  the  number  of  busses  from  n  to  approximately  y/2 n. 

Notice,  however,  that  in  order  to  achieve  another  factor-of-2  improvement  in  the  arbitration 
time,  adding  another  constant  number  of  busses  to  the  lg  n  busses  is  not  enough.  Asymptot¬ 
ically,  as  n  grows  without  bound,  we  need  to  use  more  than  (1  +  e)lgn  busses,  for  e  >  0.232, 
in  order  for  the  sum  ]C/=o(7)»  *  =  jlgn,  to  be  at  least  n.  This  can  be  verified  by 

Stirling’s  formula,  since  when  m  is  greater  than  Ign  but  smaller  than  1.232 Ign,  and  when 
t  =  jlgn  <  m/4,  the  sum  of  the  first  m/4  binomial  coefficients  (7),  for  0  <  /  <  m/4,  does 
not  exceed  n.  This  demonstrates  that  our  binomial  arbitration  scheme,  which  uses  lg  n  +  1 
busses,  exhibits  a  most  economic  balance,  much  more  so  than  the  binary  arbitration  scheme. 
Other  authors  [23]  have  also  discovered  that  by  excluding  certain  codewords,  the  arbitration 
time  of  binary  arbitration  can  be  reduced.  Here,  however,  we  give  the  first  general  scheme  that 
provides  a  full  spectrum  of  bus-time  tradeoff. 
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3.5  Properties  of  asynchronous  priority  arbitration  schemes 

In  this  section  we  discuss  properties  and  capabilities  of  general  asynchronous  priority  arbitration 
schemes  with  busses,  which  were  defined  in  Section  3.2.4.  We  first  describe  several  properties 
and  assumptions  regarding  asynchronous  priority  arbitration  schemes  with  busses.  We  then 
define  a  canonical  form  for  acyclic  arbitration  protocols  that  is  easier  to  analyze  and  reason 
about  than  arbitrary  acyclic  arbitration  protocols.  Finally,  we  focus  on  the  bus-time  tradeoff 
of  general  synchronous  priority  arbitration  schemes  and  present  some  lower  bound  arguments 
that  demonstrate  the  efficiency  of  our  schemes. 

3.5.1  General  properties  and  assumptions 

Asynchronous  priority  arbitration  schemes  that  employ  busses  arbitrate  among  contending 
modules  by  having  the  modules  read  logic  values  from  the  busses  and  apply  logic  values  to  the 
busses,  according  to  an  underlying  acyclic  arbitration  protocol.  For  an  asynchronous  priority 
arbitration  scheme  A  =  { P ,  F ,  win)  that  employs  m  busses,  the  acyclic  arbitration  protocol  F  is 
a  sequence  of  m  functions,  each  responsible  for  applying  a  binary  value  to  a  separate  bus,  based 
on  the  competing  module’s  arbitration  priority  and  on  logic  values  on  higher  indexed  busses. 
The  acyclic  nature  of  the  arbitration  protocol  F  guarantees  termination  of  any  arbitration 
process  in  at  most  t  =  m  stages,  as  was  formally  discussed  in  Section  3.2.3.  We  are  also 
interested,  however,  is  asynchronous  priority  arbitration  schemes  that  arbitrate  in  t  stages,  for 
any  value  of  t  in  the  range  0  <  t  <  m. 

The  configurations  of  the  m  arbitration  busses  play  a  fundamental  role  in  the  analysis  of 
arbitration  processes.  A  configuration  of  the  m  busses  at  any  given  time  is  simply  the  m-bit 
vector  of  logic  values  on  the  busses.  We  denote  a  general  configuration  on  the  m  busses  by 
v  =  (um_i, . . .,  V|,  vo)>  and  for  arbitration  processes  we  use  v[fc]  =  (vm_i[fc], . . .,  vj[Ar],  uo[k]),  for 
k  >  0,  to  denote  the  configuration  of  the  m  busses  at  stage  k.  We  assume  that  any  arbitration 
process  starts  from  a  “clean”  configuration  of  all  0’s,  that  is,  r;[0]  =  0  for  j  =  0, 1, . . .,  m  -  1. 
An  acyclic  arbitration  protocol  F  of  size  m  can  be  thought  of  as  a  function  that  maps  an 
arbitration  priority  p  and  a  configuration  v  to  an  m-bit  vector  u  that  a  contending  module  c 
with  arbitration  priority  p  applies  to  the  m  busses,  when  detecting  the  configuration  v.  When 
convenient,  we  use  the  vector  notation  F(p,v)  =  u  to  describe  this  situation. 
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For  an  asynchronous  priority  arbitration  scheme,  A(n,  m,t)  =  (P,F,  win)  ,  on  n  modules, 
m  busses,  and  t  stages,  any  arbitration  process  on  a  subset  Q  C  P  takes  at  most  t  computation 
stages.  There  may  be,  however,  certain  arbitration  processes  that  take  less  than  t  stages,  but  it 
is  guaranteed  that  after  t  stages,  the  busses  are  always  stable.  Since  A  =  (P,  P,  win)  implements 
priority  arbitration  and  since  there  are  n  modules  in  the  system,  there  must  be  at  least  n  distinct 
winning  configurations,  each  being  mapped  by  the  interpretation  function  win  to  a  unique 
arbitration  priority  p,,  which  identifies  module  c,  as  the  winner  of  an  arbitration  process.  Some 
modules  may  have  more  than  one  winning  configuration,  as  is  the  case  for  example  with  the 
linear  arbitration  scheme  of  Section  3.3.1,  but  each  module  must  have  at  least  one.  Because  the 
number  of  intermediate  and  winning  configurations  in  arbitration  processes  is  hard  to  track, 
it  is  difficult  to  analyze  the  behavior  of  arbitration  processes.  In  Section  3.5.2,  we  show  how 
to  translate  arbitration  protocols  into  a  canonical  form,  which  has  the  same  arbitration  power, 
but  is  easier  to  analyze. 

3.5.2  Canonical  form  for  arbitration  protocols 

In  an  arbitration  process  of  an  asynchronous  priority  arbitration  scheme  with  busses,  the  com¬ 
peting  module  c  with  the  highest  arbitration  priority  p  should  direct  the  arbitration  process 
to  a  winning  configuration  v  that  identifies  it,  that  is,  win(v)  =  p.  This  should  be  the  case 
no  matter  which  of  the  modules  with  arbitration  priorities  smaller  than  p  participate  in  the 
arbitration  process.  For  competing  module  c,  with  arbitration  priority  p,,  therefore,  there  may 
be  as  many  as  2'  different  arbitration  processes  that  module  c,  should  win,  corresponding  to 
all  possible  subsets  of  the  modules  {co,cj,. .  .,Ci_i}  participating  in  the  arbitration  process. 
To  simplify  the  analysis  of  arbitration  processes,  we  introduce  a  canonical  form  of  arbitration 
protocols,  which  has  the  same  arbitration  power,  but  is  easier  to  analyze. 

Definition  15  Let  P  =  {pp,pi, •  • -,Pn-i}  be  a  set  of  n  distinct  arbitration  priorities  and  let 
F  =  (/m_i, . . .,  fo)  be  an  acyclic  arbitration  protocol  of  size  m  for  P.  We  say  that  F  is  in 
canonical  form,  if  for  any  configuration  v  =  (vm_i, . . .,  iq,  v0),  for  any  j  =  0,1,  —  m  -  1,  for 
any  z  =  0, 1, . . . ,  n  -  1,  and  for  any  0  <  k  <  i,  we  have 

fj  ( P*  *  — I  y  •  •  • »  Vj + 1 )  =  0  ''  f](Pki  Vm  —  !»••••  ^j+l)  0  . 
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Definition  15,  in  effect,  defines  a  canonical  acyclic  arbitration  protocol  as  one  that  maps 
any  arbitration  priority  p  and  configuration  v  to  an  m-bit  vector  u  that  “shadows”  any  activity 
of  arbitration  priorities  of  lesser  priority  that  p.  The  definition  guarantees  that  if  a  module 
applies  a  0  to  a  certain  bus  in  response  to  some  configuration  v  of  the  busses,  then  no  module 
with  lesser  priority  applies  a  1  to  that  bus  in  response  to  the  same  configuration  v.  In  other 
words,  for  any  arbitration  priorities  p  and  q,  with  p  being  of  higher  priority  than  q,  and  for  any 
configuration  v,  the  m-bit  vector  F(p,  t>)  is  never  component- wise  smaller  than  the  m-bit  vector 
F(q,  v).  In  analyzing  arbitration  processes  of  canonical  acyclic  arbitration  protocols,  therefore, 
it  is  sufficient  to  focus  only  on  the  behavior  of  the  highest  competing  arbitration  priority  p, 
since  the  protocol  for  p  always  “shadows”  the  behavior  of  smaller  arbitration  priorities.  We 
call  an  asynchronous  priority  arbitration  scheme  canonical  if  its  acyclic  arbitration  protocol 
is  canonical.  We  typically  denote  that  an  arbitration  scheme  or  an  arbitration  protocol  are 
canonical  by  putting  a  bar  over  them,  as  in  A  or  F.  Analyzing  canonical  asynchronous  priority 
arbitration  schemes  is  an  easier  task.  The  next  theorem  demonstrates  that  analyzing  canonical 
asynchronous  priority  arbitration  schemes  is  also  general  enough. 

Theorem  27  Let  >4(n,  m,  t)  =  {P,  F,  win)  be  an  asynchronous  priority  arbitration  scheme 
on  n  modules,  m  busses,  and  t  stages.  Then  there  is  also  a  canonical  asynchronous  priority 
arbitration  scheme  A(n,m,t)  =  (P,  F,  win)  on  n  modules,  m  busses,  and  t  stages. 

Proof.  To  define  the  canonical  asynchronous  priority  arbitration  scheme  A  —  (P,  F,  win),  we 
need  only  define  the  canonical  acyclic  arbitration  protocol  F\  the  arbitration  priorities  P  and 
the  interpretation  function  win  are  identical  to  those  of  A.  We  define  F  =  (fm-i,  ■  ■  • » /i>  fo) 
as  follows.'  For  any  configuration  v  =  . .,  t>j,  u0),  for  any  j  =  0, 1, . . .,  m  -  1,  and  for  any 

i  =  0, 1, . . . ,  n  —  1,  we  define 

t 

4vPi i  ®">— 1 1  •  •  ■  t  pi+l )  =  V  fjiPh  —  1,  ■  •  • ,  vj+l  )  • 

(=0 

In  fact,  we  define  the  m-bit  vector  that  module  c,  with  arbitration  priority  p,  applies  to  the  m 
busses  under  protocol  F  in  response  to  a  configuration  v ,  to  be  the  bitwise  OR  of  the  m-bit 

vectors  that  modules  co,ci,...,c,  with  corresponding  arbitration  priorities  po.pi . p,  apply 

to  the  m  busses  under  protocol  F  in  response  to  the  same  configuration  v. 
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To  show  that  A  =  { P ,  F,  win)  is  a  canonical  asynchronous  priority  arbitration  scheme  on  n 
modules,  m  busses,  and  t  stages,  we  first  notice  that  P  is  a  set  of  n  distinct  arbitration  priorities, 
as  required.  The  arbitration  protocol  F  =  is  acyclic,  since  by  definition,  each 

function  fj,  for  j  =  0, 1, . . . ,  m  —  1,  takes  an  arbitration  priority  p  €  P  and  m  —  1  -  j  bit  values 
(rm_i, . . .,  Vj+i)  and  produces  one  bit,  as  required.  Furthermore,  F  =  •  • -,/i»/o)  is  in 

canonical  form,  since  for  any  configuration  v  —  uq),  for  any  j  =  0,1, ...,m  -  1, 

for  any  i  =  0, 1, . . . ,  n  -  1,  and  for  any  0  <  k  <  »,  we  have 

vrn-u...,v]+i)  =  0 

t 

=>  \J  =  0 

1=0 

k 

=>  \J  fjiPhVm-U-'-iVj+x)  =  0 

1=0 

=>  fj(pk,vm-\,...,Vj+i)  =  0  , 

as  required  by  Definition  15.  We  then  have  that  F  is  a  canonical  acyclic  arbitration  protocol 
of  size  m  for  P. 

We  now  argue  that  for  any  Q  C  P,  the  arbitration  process  of  F  on  Q  takes  at  most  t 
stages.  Let  p,  €  Q  be  the  highest  arbitration  priority  in  Q.  Because  F  is  in  canonical  form, 
the  arbitration  process  of  F  on  {p,}  is  indistinguishable  from  the  arbitration  process  of  F  on 
Q.  (Under  F,  arbitration  priority  p,  always  “shadows”  the  activity  of  Q.)  By  our  definition  of 
F,  the  arbitration  process  of  F  on  {p,}  is  an  exact  simulation  of  the  arbitration  process  of  F 
on  {po,Pi,-  •  -  ,Pi},  which  by  definition  of  A  takes  at  most  t  stages.  We  then  conclude  that  the 
arbitration  process  of  F  on  {p,}  takes  at  most  t  stages,  which  also  means  that  the  arbitration 
process  of  F  on  Q  takes  at  most  t  stages. 

Last,  we  verify  that  the  function  win  is  indeed  an  interpretation  function  for  P  and  F.  Let 
Q  C  P  be  a  set  of  competing  arbitration  priorities  and  let  p,-  €  Q  be  the  highest  arbitration 
priority  in  Q.  Let  v  be  the  resolution  of  the  arbitration  process  of  F  on  Q.  As  argued  above, 
v  is  also  the  resolution  of  the  arbitration  process  of  F  on  {pi},  which  i6  the  resolution  of  the 
arbitration  process  of  F  on  {po(Pi>  •  •  •  »?«}•  Since  pi  is  also  the  highest  arbitration  priority  in 
{po,pi,. .  .,Pi},  and  since  win  is  an  interpretation  function  for  P  and  F,  we  have  wiN(r)  =  p,, 
which  implies  that  win  is  also  an  interpretation  function  for  P  and  F. 
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This  completes  the  proof  that  A  =  (P,F,  win)  is  a  canonical  asynchronous  priority  arbi¬ 
tration  scheme  on  n  modules,  m  busses,  and  t  stages.  I 

Theorem  27  shows  that  canonical  acyclic  arbitration  protocols  have  the  same  arbitration 
power  as  other  acyclic  arbitration  protocols.  The  proof  transforms  an  acyclic  arbitration  pro¬ 
tocol  F  into  a  canonical  acyclic  arbitration  protocol  F ,  by  having  module  c,  with  arbitration 
priority  p,  be  paranoid  and  always  assume  that  all  the  modules  co,cx, . . .  ,c^_i  with  arbitra¬ 
tion  priorities  po,Px, . .  .,p,_j  also  participate  in  its  arbitration  processes.  Under  protocol  F, 
then,  module  c,-  responds  to  any  configuration  by  simulating  the  combined  responses  of  modules 
Co, ci, ... , c,  to  the  same  configuration  under  protocol  F. 

For  example,  transforming  the  asynchronous  linear  arbitration  scheme  of  Section  3.3.1  to 
canonical  form,  results  in  a  scheme  where  to  arbitrate,  contending  module  c,  applies  a  1  to 
busses  bi, . . . ,  6<),  and  does  not  interfere  with  other  busses.  After  t  =  1  units  of  time,  all  the 
busses  stabilize  on  their  final  values,  and  the  module  with  a  1  on  the  highest  indexed  bus  is 
recognized  as  the  winner.  Formally,  this  scheme  is  derived  from  LlNEAR(n,  n,  1)  =  (P,  F,  win), 
and  is  defined  as  Canon;cal-LineaR(r, n,  1)  =  (P,  F,  win),  where 

1.  P  =  {p,  =  O"-1-*  1  0‘  :  for  i  =  0, 1 . n  -  1}; 

2.  F  =  (/n_i,..  .,/i,/o),  where  for  j  =  0,1,...,  n  -  1  and  i  =  0,1,...,  n  -  1,  we  have 
fj{p„  Vn—i ...,  Vj+i)  =  1  if  j  <  *  and  /;(p„  vn_x . . . ,  v;+i)  =  0  if  ;  >  i; 

3.  wiN(0fc  1  a)  =  0*  1  on-1_t  =  pn_i_fc,  for  0  <  k  <  n  -  1  and  any  a  €  {0,  l}"-1-*. 

We  use  the  canonical  forms  of  arbitration  protocols  for  analysis  purposes  only.  In  practice, 
there  may  be  several  drawbacks  to  using  canonical  forms  of  acyclic  arbitration  protocols,  due 
to  their  overly  paranoid  behavior.  The  advantage  of  canonical  forms  arises  in  investigating  the 
computational  power  of  asynchronous  priority  arbitration  schemes  with  busses.  When  analyzing 
an  asynchronous  priority  arbitration  scheme  for  n  modules,  there  may  be  a  need  to  investigate 
all  possible  2n  arbitration  processes,  corresponding  to  the  2n  possible  subsets  of  competing 
modules.  For  a  canonical  asynchronous  priority  arbitration  scheme  on  n  modules,  however, 
there  are  exactly  n  different  arbitration  processes  to  analyze,  and  there  are  exactly  n  reach¬ 
able  winning  configurations.  This  is  the  case  since  for  canonical  pro  ^  -ols,  higher  arbitration 
priorities  always  “shadow”  the  activity  of  smaller  arbitration  priorities. 
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3.5.3  The  bus-time  tradeoff 

Analytically,  the  simplest  way  to  define  the  optimal  bus-time  tradeoff  of  asynchronous  priority 
arbitration  schemes  is  to  fix  m,  the  number  of  arbitration  busses  used,  to  fix  t ,  the  number 
of  arbitration  stages  allowed,  and  to  investigate  the  largest  number  of  modules  that  can  be 
arbitrated  by  some  asynchronous  priority  arbitration  scheme  with  m  busses  in  at  most  t  stages. 
Formally,  we  define  7l(m,t),  for  m  >  0  and  t  >  0,  as  the  smallest  integer,  such  that  any 
A(n ,  m,  t )  =  { P ’,  F,  win)  ,  an  asynchronous  priority  arbitration  scheme  for  n  modules,  m  busses, 
and  t  stages,  satisfies  n  <  72(m,t).  Theorem  27  implies  that  in  investigating  7Z(m,t ),  it  suffices 
to  focus  only  on  canonical  asynchronous  priority  arbitration  schemes  with  m  busses  and  t  stages. 
We  take  advantage  of  this  fact  when  convenient.  The  following  lemma  shows  that  the  value  of 
7 Z(m,t)  is  well  defined  for  any  m  >  0  and  t  >  0. 

Lemma  28  For  any  m  >  0  and  t  >  0,  we  have  lZ(m,t)  <  2m. 

Proof.  Let  A(n,m,t)  =  (F,  F,  win)  be  a  canonical  asynchronous  priority  arbitration  scheme 
on  n  modules,  m  busses,  and  t  stages.  With  m  busses  there  are  no  more  than  2m  possible 
configurations  of  binary  values  on  the  busses,  but  there  must  be  exactly  n  distinct  resolutions 
of  arbitration  processes  of  A.  We  must  then  have  n  <  2m.  Since  this  bound  holds  for  arbitrary 
canonical  asynchronous  priority  arbitration  schemes,  we  also  have  7£(m,t)  <  2m.  H 

Lemma  28  states  that  no  more  than  2m  modules  can  be  arbitrated  with  m  busses.  Given 
enough  time,  we  can  arbitrate  among  exactly  n  =  2m  modules,  as  the  following  lemma  implies. 

Lemma  29  For  any  m  >  0  we  have  7Z(m,  m)  =  2m. 

Proof.  The  asynchronous  binary  arbitration  scheme  of  Section  3.3.2  arbitrates  among  n  mod¬ 
ules,  using  m  =  lgn  busses  and  t  =  m  =  lgn  stages.  Said  another  way,  with  m  busses  and  in 
t  =  m  stages,  exactly  n  =  2m  modules  can  be  arbitrated.  Combining  this  with  the  result  of 
Lemma  28,  we  have  TZ{m,  m)  =  2m.  M 

From  Lemmas  28  and  29  it  follows  that  there  is  no  advantage  in  using  more  units  of  time 
than  the  number  of  busses.  We  summarize  this  observation  in  the  following  theorem. 

Theorem  30  For  any  t  >  m  >  0  we  have  7 Z(m,t)  =  2m. 


3.5.  PROPERTIES  OF  ASYNCHRONOUS  PRIORITY  ARBITRATION  SCHEMES  81 


The  next  theorem  shows  that  TZ(m,t)  is  monotonicly  nondecreasing  in  both  m  and  t. 
Theorem  31  For  any  m  >  0  and  1  >  0  we  have 

1.  R(m  +  1,1)  >  R(m,  t), 

2.  TZ(m,t+  1)  >  R(m,  1). 

Proof.  Increasing  the  number  of  arbitration  busses  or  the  number  of  arbitration  stages  cannot 
decrease  the  number  of  modules  that  can  arbitrate.  We  show  this  by  describing  how  to  simulate 
any  asynchronous  priority  arbitration  scheme  A(n,m,t)  =  (P,  F,  win)  by  a  scheme  with  more 
busses  or  time. 

1.  Define  A'{n,m  +  1,1)  =  (P1,  F',  win')  as  follows.  The  arbitration  priorities  P'  =  P 

are  unchanged.  If  F  =  (fm-i,  • . then  define  F'  =  {f'm,  . .  • ,  /{,  /o),  where 

/'  =  /j_!  for  j  =  1,2, ..  .,m,  and  fo(p,v)  =  0  for  any  p  €  P'  and  v  €  {0,  l}m.  Finally, 
we  define  wlN'(t;m,  rm_v, . . . ,  t>,,  v0)  =  wlN(um,  vm_i, . . . ,  ui)  for  vj  €  {0,1}  and  j  = 
0,1,..., m.  Informally,  the  asynchronous  priority  arbitration  scheme  A'  simulates  A  on 
the  first  m  busses  and  ignores  the  last  bus.  Since  this  simulation  method  works  for 
arbitrary  asynchronous  priority  arbitration  schemes,  we  then  have  7 2(m+  1,1)  >  TZ(m,t)- 

2.  Since  A(n,m,t)  =  (P,F,  win)  arbitrates  among  any  Q  C  P  in  at  most  t  stages,  it  also 

arbitrates  in  at  most  1  +  1  si«-^  ,  which  shows  that  7£(m,l  +  1)  >  7J(m,  1).  H 

We  now  turn  to  investigate  7Z(m,  1),  for  values  of  m  >  1  >  0.  The  next  lemma  investigates 
the  case  1  =  0. 

Lemma  32  For  any  m  >  0  we  have  TZ(m,  0)  =  1. 

Proof.  With  1  =  0  stages  and  m  busses  to  arbitrate,  for  any  value  of  m  >  0,  the  reading  of 
the  busses  after  1  =  0  stages  consists  of  m  zeros.  It  then  follows  that  we  can  arbitrate  among 
at  most  one  module,  that  is  7£(m,0)  =  1  for  any  m  >  0.  ® 

We  next  investigate  R(m,t)  for  the  case  1  =  1.  The  following  theorem  demonstrates  that 
any  canonical  asynchronous  priority  arbitration  scheme  with  m  busses  can  be  in  at  most  m  +  1 
different  configurations  after  1=1  stages. 
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Theorem  33  Let  A(n,m,t)  =  (P,F,  win)  be  a  canonical  asynchronous  priority  arbitration 
scheme  on  n  modules,  m  busses,  and  t  stages.  Let  U  =  {u  :  u  =  F(p,  0m)  for  p  €  P)  be  the 
set  of  all  possible  responses  of  modules  of  A  to  the  initial  configuration  v  —  0m .  Then  we  have 
\U\  <  m  +  1. 

Proof.  For  convenience  of  analysis,  we  refine  the  definition  of  U.  Corresponding 
to  P  =  {po,pi, . .  .,pn~\},  the  set  of  responses  U  is  a  set  of  m-bit  vectors  U  = 
{u,  :  u,  =  F(pi, 0m)  for  i  —  0, —  1}.  Each  m-bit  vector  ut  €  U,  is  the  response  of  p, 
under  F  to  the  configuration  v  =  0m.  Since  F  is  a  canonical  acyclic  arbitration  protocol  and 
since  the  arbitration  priorities  are  indexed  in  increasing  order  of  priority,  we  must  have  that  for 
any  0<k<i<n  —  1,  the  m-bit  vector  Ui  has  a  1  at  component  j  if  the  m-bit  vector  u*  has  a 
1  at  component  j,  for  j  =  0, 1, . . . ,  m  -  1.  This  implies,  by  the  pigeonhole  principle,  that  there 
cannot  be  more  than  m  +  1  such  m-bit  vectors  in  U ,  or  that  ][/|  <  m  +  1.  H 

Armed  with  Theorem  33,  we  can  now  show  that  72(m,  1)  =  m  -f  1. 

Lemma  34  For  any  m  >  0  we  have  R(m,  1)  =  m  +  1. 

Proof.  From  Theorem  33  it  follows  that  any  canonical  asynchronous  priority  arbitration 
scheme  -4(n,  m,  1)  =  (P,F,  win)  on  n  modules,  m  busses,  and  t  =  1  stages,  can  reach  at  most 
m  +  1  distinct  resolutions.  For  any  sucn  canonical  asynchronous  priority  arbitration  scheme. 
A,  there  must  be  exactly  n  resolutions,  which  implies  that  n  <  m  +  1.  Since  this  bound  holds 
for  arbitrary  .4,  we  then  also  have  R(m,  1)  <  m  +  1. 

With  t  =  1,  our  generalized  binomial  arbitration  scheme  of  Section  3.4.2  achieves  n  = 
H/=o  (?)  =  (7)  +  (7)  =  1  +  m.  We  therefore  conclude  that  TZ(m,  1)  =  m  +  1.  I 

We  next  generalize  Theorem  33,  by  showing  that  any  canonical  asynchronous  priority  arbi¬ 
tration  scheme  with  m  busses  and  t  stages  can  be  in  at  most  (m^1)s!  different  configurations 
after  0  <  s  <  t  stages. 

Theorem  35  Let  A(n,  m,  t)  =  (P,  F,  win)  be  a  canonical  asynchronous  priority  arbitration 
scheme  on  n  modules,  m  busses,  and  t  stages.  Let  t/[0j  =  {0m}  be  the  set  of  the  initial 
configuration  of  m  bits  of  0,  and  let  f/[s],  for  1  <  s  <  t,  be  the  set  of  possible  configurations  of 
A  after  s  stages.  Then,  for  any  0  <  s  <  t,  we  have  |f/[s]|  < 
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Proof.  We  prove  the  theorem  by  induction  on  s.  For  convenience  of  analysis,  we  first  refine 
the  definition  of  If  [3],  for  0  <  3  <  t. 

Due  to  the  canonical  nature  of  the  asynchronous  priority  arbitration  scheme  A,  there  are 
exactly  n  distinct  arbitration  processes  to  analyze,  each  corresponding  to  a  different  module 
c,  being  the  highest  priority  module  arbitrating.  We  begin  by  defining  the  sequence  of  con¬ 
figurations  that  module  C{  with  arbitration  priority  p;  generates  if  Cj  is  the  highest  priority 
module  that  arbitrates.  For  any  0  <  *  <  n  —  1,  we  define  u,[0]  =  0m  and  we  inductively  define 
u,[s]  =  F(pi,Ui[s  -  1]),  for  values  of  s  >  1.  The  canonical  nature  of  the  acyclic  arbitration 
protocol  F  guarantees  that  the  m-bit  vector  u,[s]  is  the  configuration  of  the  m  busses  after 
3  stages,  when  module  c,  is  the  highest  priority  module  arbitrating,  no  matter  which  of  the 
modules  co,ci, . . .  ,  c,-i  also  arbitrates.  The  set  £/[s]  of  all  possible  configurations  of  A  after  s 
stages,  for  any  3  >  0,  can  now  be  defined  as 

U[a]  =  U  {u,[s]}  . 

1=0 

This  is  the  case  because  if  module  c,  is  the  highest  priority  module  arbitrating,  then  the  con¬ 
figuration  of  the  m  busses  after  3  stages  is  u,[sj. 

We  now  prove  the  theorem  by  induction  on  s.  For  the  case  s  =  0,  we  have  f/[0]  =  {0m}  and 
|C/]0]|  =  1  =  (”*0  *)0!.  For  3=1,  we  have  from  Theorem  33  that  |£/[l]|  <  m  +  1  =  (m*1)l!.  We 
now  assume  that  for  3  -  1  we  have  |(/[s  -  1]|  <  (7+/)(a  ~  1)!»  and  show  that  1 1/[^]|  <  (m*1)s!. 

The  set  of  possible  configurations  of  the  m  busses  after  3  -  1  stages  is  U[s  -  1].  Each 
configuration  ti  €  U[s  —  1]  defines  an  equivalence  class,  Cu  =  {c,  :  u,[s  -  1]  =  u},  of  all  the 
modules  c,  that  bring  the  busses  to  configuration  u  after  s  -  1  stages.  (Correspondingly,  we 
define  Pu  =  {p,  :  ti,[s  -  1]  =  u},  for  each  u  €  U[a  -  1].)  This  definition  implies  that  for  any 
u  €  U[s  -  1],  the  configuration  of  the  m  busses  after  s  -  1  stages  is  u  if  and  only  if  some  module 
c,  €  Cu  is  the  highest  priority  module  arbitrating.  Furthermore,  for  each  u  £  {/[s  -  1]  (or  for 
each  e,  €  Cu),  the  first  s  -  1  busses  6m_i ,  hm_2, . . . ,  hm-«+i  have  stabilized  on  the  first  s  -  1  bits 
of  u.  The  modules  in  Cu  have  only  the  last  m  -  3  +  1  busses  . .  .,60  to  which  they 

can  apply  new  values  at  stage  3.  Focusing  on  the  last  m  -  s  -I- 1  busses,  an  argument  similar  to 
that  of  the  proof  of  Theorem  33  shows  that  there  are  at  most  m  -  s  +  2  different  responses  of 
modules  in  Cu  during  stage  3.  Said  formally,  for  any  u  €  U[s  -  1]  we  have 
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U  {*(*•)} 


<  m  -  a  +  2  . 


i  P€P„  I 

That  is,  any  configuration  u  €  U[s  -  1]  can  develop  to  no  more  than  m  - 
during  stage  s.  By  definition,  we  have 


s  +  2  configurations 


which  implies 


I'M  =  U  U  . 

u£U[a—  1]  p€Pu 


\U[s]\  < 
< 


|£/[s  -  1]|  •  (m  —  s  +  2) 

(7_+11)(-  3  +  2) 


(m  +  1)! 
(m  -  s  +  2)! 
(m  +  1)! 


•  (m  -  s  +  2) 


which  completes  the  proof  of  the  theorem.  I 

Theorem  35  demonstrates  that  any  canonical  asynchronous  priority  arbitration  scheme  with 
m  busses  and  t  stages  can  be  in  no  more  than  {m^’1)3!  different  configurations  after  s  stages, 
for  any  0  <  s  <  t.  The  result  of  Theorem  35  implies  the  following  theorem. 


Theorem  30  For  any  m  >  t  >  0,  we  have  TZ(m,t)  <  (m^1)t!. 

Proof.  Let  =  (P,  F,  win)  be  a  canonical  asynchronous  priority  arbitration  scheme 

on  n  modules,  m  busses,  and  t  stages.  From  Theorem  35  we  have  that  the  number  of  possible 
configurations  that  A  can  be  in  after  t  stages  in  at  most  (m|M)t!.  We  then  have  n  < 
because  A  has  exactly  n  resolutions.  Since  this  discussion  holds  for  arbitrary  ^4,  we  conclude 
that  1Z(m,t)  <  B 

The  preceding  analysis  provides  several  nontrivial  bounds  for  the  bus-time  tradeoff  of  gen¬ 
eral  asynchronous  priority  arbitration  schemes.  These  bounds  were  obtained  by  analyzing  the 
canonical  forms  of  such  schemes.  We  conjecture,  however,  that  the  bounds  of  Theorem  35  and 
of  Theorem  36  are  not  tight  in  general,  and  that  the  tight  bound  for  the  bus-time  tradeoff  is 
R(m,t)  =  Z]|=o(7)’  ex^ibited  by  our  generalized  binomial  arbitration  scheme. 
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3.6  Discussion  and  extensions 

This  section  contains  some  discussion,  additional  results,  and  directions  for  further  research  on 
priority  arbitration  with  busses. 

3.6.1  The  fc-ary  arbitration  scheme 

The  linear  arbitration  and  binary  arbitration  schemes  of  Section  3.3  use  n-ary  and  binary 
representations,  respectively,  of  module  priorities.  We  can  also  use  radi x-fc  representation  of 
module  priorities,  for  other  values  of  fc,  to  arbitrate  among  n  =  fc*  modules  in  t  units  of 
time,  using  m  =  tk  busses.  We  sketch  the  asynchronous  fc-ary  arbitration  scheme  here  due 
to  its  simplicity  and  because  it  generalizes  the  linear  and  binary  arbitration  schemes  rather 
straightforwardly.  This  scheme  exhibits  a  bus-time  tradeoff  of  the  form  m  =  tn */*,  which  is  a 
factor  of  e  worse  than  the  asymptotic  bus-time  tradeoff  exhibited  by  our  generalized  binomial 
arbitration  scheme  of  Section  3.4.2. 

Asynchronous  fc-ary  arbitration,  for  2  <  k  <  n,  can  be  described  as  follows.  Each  module  is 
assigned  a  unique  Ar-ary  arbitration  priority  consisting  of  t  radix-fc  digits.  We  divide  the  m  -  tk 
busses  into  1  disjoint  groups,  each  consisting  of  k  busses.  During  arbitration,  competing  module 
c  applies  the  t  radix-fc  digits  of  its  arbitration  priority  p  to  the  t  groups  of  busses,  using  linear 
encoding  of  its  digits  on  each  group  of  k  busses.  As  arbitration  progresses,  competing  module 
c  monitors  the  t  groups  of  busses  and  disables  its  drivers  according  to  the  following  rule:  let 
p(;)  be  the  1th  radix-fc  digit  of  p  and  d/  be  the  highest  index  of  a  bus  in  the  / th  group  of  busses 
that  carries  a  1.  Then  if  p^  <  dj,  module  c  disables  all  its  digits  p^  for  j  <  l.  Disabled 
digits  are  re-enabled  should  the  condition  cease  to  hold.  Arbitration  proceeds  in  t  stages,  each 
of  which  consists  of  resolving  the  value  of  another  radix-fc  digit  of  the  highest  competing  fc-ary 
arbitration  priority. 

The  asynchronous  fc-ary  arbitration  scheme  combines  the  ideas  of  the  asynchronous  binary 
protocol  with  linear  encoding  of  arbitration  priorities,  to  achieve  an  intermediate  bus-time 
tradeoff,  m  =  In1/*.  The  acyclic  arbitration  protocol  of  fc-ary  arbitration  is  of  size  m  =  tk,  but 
its  depth  is  only  d  =  t.  The  analysis  of  fc-ary  arbitration  is  a  static  one,  similar  to  the  analysis 
of  binary  arbitration.  Implementing  the  asynchronous  fc-ary  arbitration  scheme,  however,  may 
require  a  different  circuitry  for  arbitration  in  radix  fc.  Our  generalized  binomial  arbitration 
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scheme,  besides  achieving  a  better  bus-time  tradeoff,  is  also  immediately  implementable  on 
any  arbitration  circuitry  of  binary  arbitration,  which  is  the  most  commonly  used  asynchronous 
priority  arbitration  scheme  with  busses. 

3.6.2  Bus-time  tradeoff  of  asynchronous  priority  arbitration 

In  Section  3.5.3,  we  proved  that  any  asynchronous  priority  arbitration  scheme  on  n  modules, 
m  busses,  and  t  stages,  satisfies  n  <  (m*1)t!.  Our  generalized  binomial  arbitration  scheme  of 
Section  3.4.2  achieves  a  better  bus-time  tradeoff  of  the  form  n  =  ]lLo(?)-  There  is  still  a 
gap  between  the  upper  and  the  lower  bounds  on  the  bus-time  tradeoff  of  asynchronous  priority 
arbitration  schemes.  We  conjecture  that  the  bus-time  tradeoff  exhibited  by  the  generalized 
binomial  arbitration  scheme  is  optimal  for  our  model  of  asynchronous  priority  arbitration  with 
busses,  but  we  were  unable  to  prove  or  disprove  it.  Using  the  notation  of  Section  3.5.3,  we 
conjecture  that  7£(m,t)  =  (’]*),  for  any  m  >  0  and  t  >  0. 

3.6.3  Synchronous  priority  arbitration  schemes 

In  this  chapter  we  discussed  the  asynchronous  model  of  priority  arbitration  with  busses  and 
presented  several  asynchronous  schemes.  Considering  synchronous  priority  arbitration  scheme 
that  use  clocked  arbitration  logic,  one  can  show  that  a  synchronous  version  of  Ar-ary  arbitration 
achieves  a  bus-time  tradeoff  of  the  form  m  =  n 1f*.  (Variants  of  this  scheme  are  used  in 
synchronous  communication  protocols  (see  (45,  71]).  In  synchronous  priority  arbitration,  busses 
can  be  reused  on  successive  dock  cycles,  which  enables  a  better  bus- time  tradeoff  than  that  of 
asynchronous  priority  arbitration,  in  that  there  is  no  multiplicative  factor  of  t  in  the  bus-time 
tradeoff  m  =  n1^. 

For  synchronous  priority  arbitration  schemes,  a  related  arbitration  model  can  be  defined. 
In  this  model  it  is  possible  to  prove  that  the  tradeoff  m  =  ©(n1^)  is  optimal.  The  proof  utilizes 
the  result  of  Lemma  34  that  with  m  busses  at  most  n  =  m  +  1  modules  can  be  arbitrated  in 
t  =  1  stages.  Using  synchronous  priority  arbitration  in  t  stages,  one  cannot  do  any  better  than 
arbitrating  among  at  most  n  =  (m  +  1)‘  modules,  whicn  implies  the  optimality  of  the  tradeoff 
m  =  ©(n1/*)  exhibited  by  the  synchronous  version  of  k- ary  arbitration. 
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3.6.4  Resource  tradeoffs 

Resource  tradeoffs  of  the  form  m  =  ©(tn1/*),  based  on  multiway  trees  and  the  special  class  of 
binomial  trees,  are  discussed  in  [8]  for  a  variety  of  problems  such  as  parallel  sorting  algorithms, 
searching  algorithms,  and  VLSI  layouts.  Asynchronous  priority  arbitration  with  busses  can  in 
fact  be  considered  as  a  selection  process  on  trees.  Asynchronous  k- ary  arbitration  corresponds 
to  a  selection  process  on  regular  trees  of  branching  factor  Jb,  while  asynchronous  generalized 
binomial  arbitration  corresponds  to  a  selection  process  on  the  more  economical  “modified  bi¬ 
nomial  trees”  of  [8]. 

3.6.5  Directions  for  further  research 

In  this  chapter  we  investigated  a  model  for  the  settling  of  a  digital  bus  that  assumes  a  unit 
of  time  (bus-settling  delay)  for  the  bus  to  stabilize  to  a  valid  logic  value.  There  are  several 
situations,  such  as  electrical  transmission  line,  radio  channels,  and  optical  fibers,  however, 
where  a  different  analysis  based  on  distances  and  directions  may  be  required.  In  Chapter  4  we 
examine  the  performance  of  priority  arbitration  schemes  in  a  more  elaborate  model  of  a  bus  as 
a  digital  transmission  line. 

The  busses  in  the  arbitration  mechanisms  investigated  serve  as  a  shared  memory  into  which 
modules  write  and  from  which  they  read.  These  busses/memory  implement  the  OR  function 
of  the  values  written  to  them.  There  might  be  some  interest  in  other  logic  functions  that 
busses/memory  can  implement.  One  interesting  case  would  be  memory  cells  that  can  compute 
the  majority  function  on  0/1  values  written  into  them. 

Our  work  has  concentrated  on  analyzing  the  data-dependent  behavior  of  arbitration  mecha¬ 
nisms  that  use  fixed  module  priorities.  There  are  several  mechanisms  that  do  not  use  determin¬ 
istic  module  priorities  or  that  arbitrate  by  using  randomized  protocols.  It  would  be  interesting 
to  extend  our  analysis  to  these  more  flexible  or  randomized  schemes. 

Finally,  the  domain  of  data-dependent  analysis  has  not  been  heavily  investigated.  There  are 
many  interesting  circuits  that  exhibit  faster  performance  than  implied  by  the  static  measure 
of  their  depth.  A  more  systematic  approach  for  data-dependent  analysis  would  prove  to  be 
a  valuable  tool  for  circuit  designers.  There  has  been  some  focus  on  the  structure  of  delay- 
insensitive  codes  [85],  for  example,  but  not  on  data-dependent  performance  of  logic  circuits. 
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Chapter  4 


Priority  Arbitration  on  Digital 
Transmission  Busses 


This  chapter  examines  the  performance  of  priority  arbitration  schemes  presented  in  Chapter 
3  under  the  digital  transmission  line  bus  model.  This  bus  model  accounts  for  the  propaga¬ 
tion  time  of  signals  along  bus  lines  and  assumes  that  the  propagating  signals  are  always  valid 
digital  signals.  A  widely  held  misconception  is  that  in  the  digital  transmission  line  model  the 
arbitration  time  of  the  binary  arbitration  scheme  is  at  most  4  units  of  bus-propagation  delay. 
We  formally  disprove  this  conjecture  by  demonstrating  that  the  arbitration  time  of  the  binary 
arbitration  scheme  is  heavily  dependent  on  the  arrangement  of  the  arbitrating  modules  in  the 
system.  We  provide  a  general  scenario  of  module  arrangement  on  m  busses,  for  which  binary 
arbitration  takes  at  least  m/2  units  of  bus-propagation  delay  to  stabilize.  We  also  prove  that 
for  general  arrangements  of  modules  on  m  busses,  binary  arbitration  settles  in  at  most  m/2  +  2 
units  of  bus- propagation  delay,  while  binomial  arbitration  settles  in  at  most  m/4  +  2  units  of 
bus-propagation  delay,  thereby  demonstrating  the  superiority  of  binomial  arbitration  for  general 
arrangements  of  modules  under  the  digital  transmission  line  model.  For  linear  arrangements  of 
modules  in  increasing  order  of  priorities  and  equal  spacings  between  modules,  we  show  that  3 
units  of  bus- propagation  delay  are  necessary  for  binary  arbitration  to  settle,  and  we  sketch  an 
argument  that  3  units  of  bus-propagation  delay  are  also  asymptotically  sufficient. 
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4.1  Introduction 

The  nature  of  signal  propagation  through  a  communication  medium  has  a  significant  impact 
on  the  design  of  communication  protocols  for  that  medium.  In  any  communication  system, 
the  time  required  for  a  signal  sent  by  a  given  module  to  reach  another  module  depends  on  the 
propagation  speed  of  signals  in  the  communication  medium,  the  distance  between  the  modules, 
and  the  directionality  of  signal  propagation.  Although  different  communication  media  may 
have  different  signal-propagation  speeds,  qualitatively  they  can  be  modeled  in  similar  ways. 
Communication  protocols  must  account  for  signal  propagation  delays  by  allowing  enough  time 
for  information  to  disseminate  through  the  system. 

In  this  chapter  we  investigate  the  effects  of  signal  propagation  delays  through  bus  lines 
on  the  performance  of  priority  arbitration  schemes  presented  in  Chapter  3.  For  high-speed 
signals,  a  bus  acts  like  an  analog  transmission  line  with  associated  impedance  that  affects  the 
propagation  delays  (see  [5,  22,  40,  88]).  A  complete  characterization  of  signal  propagation  on 
analog  transmission  lines  involves  several  transient  effects  such  as  reflections,  superposition, 
and  attenuation  of  signals.  Analyzing  the  performance  of  communication  protocols  in  such 
detailed  analog  models  is  a  rather  difficult  task,  however,  and  to  make  such  analyses  tractable 
a  digital  transmission  line  model  for  a  bus  is  commonly  used.  This  model  accounts  for  the 
propagation  delays  of  signals  along  a  bus,  assumes  that  the  propagating  signals  are  always 
valid  digital  signals,  and  ignores  reflections,  superposition,  and  attenuation  of  signals.  The 
digital  transmission  line  model  is  a  model  of  an  idealized  digital  bus,  which  ignores  the  delays 
caused  by  the  analog  nature  of  signals  on  electrical  busses  and  focuses  on  the  delays  that  arise 
from  signal  propagation  along  bus  lines. 

Several  researchers  studied  the  performance  of  the  asynchronous  binary  arbitration  scheme 
of  Section  3.3.2  in  the  digital  transmission  line  bus  model.  Taub  [79,  81]  investigated  the 
maximal  propagation  delay  of  signals  in  the  binary  arbitration  scheme,  under  the  assumptions 
that  modules  are  linearly  arranged  in  increasing  order  of  priorities  and  that  they  are  equally 
spaced  on  the  bus  lines.  Taub  showed  that  in  such  situations  4  units  of  bus-propagation  delay 
are  sufficient  for  the  binary  arbitration  scheme  to  settle,  no  matter  how  many  bus  lines  are 
involved.  However,  Taub  claimed  that  such  an  arrangement  of  system  modules  exhibits  a  worst- 
case  scenario  and  concluded  that  4  units  of  bus- propagation  delay  are  always  sufficient  for  the 
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binary  arbitration  scheme  on  any  number  of  bus  lines.  Empirical  counterexamples  to  Taub’s 
claim  were  found  [3,  87],  which  consist  of  arranging  system  modules  in  certain  arrangements 
that  require  more  than  4  units  of  bus- propagation  delay  for  binary  arbitration  to  settle.  In  [3], 
for  instance,  Ashcroft,  Rivest,  and  Ward  provide  a  specific  example  of  arranging  n  =  4  modules 
on  m  =  7  bus  lines,  such  that  5  units  of  bus- propagation  delay  are  required  for  the  binary 
arbitration  scheme  to  stabilize.  Other  such  empirical  examples  were  found  that  contradict 
Taub’s  hypothesis  for  general  cases.  In  this  chapter,  we  identify  the  flaw  in  Taub’s  hypothesis, 
provide  tight  upper  and  lower  bounds  on  the  time  (in  units  of  bus-propagation  delay)  required 
by  binary  arbitration  for  general  arrangements  of  modules,  and  reexamine  linear  arrangements 
of  modules  in  increasing  order  of  priorities. 

In  the  remainder  of  this  chapter,  we  investigate  the  binary  arbitration  scheme  in  the  digital 
transmission  line  bus  model.  Section  4.2  discusses  some  issues  of  signal  propagation  on  electrical 
transmission  lines  and  describes  the  digital  transmission  line  model  of  a  bus.  In  Section  4.3,  we 
formally  disprove  Taub’s  conjecture  by  providing  a  general  scenario  of  module  arrangement  on 
m  busses,  for  which  binary  arbitration  takes  at  least  m/2  units  of  bus-propagation  delay.  We 
also  prove  that  for  arbitrary  arrangements  of  modules  on  m  busses,  binary  arbitration  settles 
in  at  most  m/2  4-  2  units  of  bus-propagation  delay,  while  binomial  arbitration  from  Chapter  3 
settles  in  at  most  m/4  +  2  units  of  bus-propagation  delay,  thereby  demonstrating  the  superiority 
of  binomial  arbitration  for  general  module  arrangements  in  the  digital  transmission  line  model. 
Section  4.4  examines  linear  arrangements  of  modules  in  increasing  order  of  priorities  and  equal 
spacings  between  modules  on  the  bus  lines.  In  such  arrangements,  we  show  that  3  units  of  bus- 
propagation  delay  are  necessary  for  binary  arbitration  to  settle,  and  we  sketch  an  argument 
that  3  units  of  bus-propagation  delay  are  also  asymptotically  sufficient.  Finally,  in  Section  4.5, 
we  discuss  the  results  of  this  chapter  and  indicate  directions  for  further  research. 

4.2  Busses  as  transmission  lines 

In  this  section  we  discuss  the  transmission  line  nature  of  electrical  busses.  We  first  describe 
some  of  the  analog  issues  of  electrical  bussed  transmission  lines,  which  affect  the  design  of  many 
bus  systems  and  protocols.  We  then  present  the  digital  transmission  line  model  of  a  bus,  which 
serves  as  a  low-level  digital  abstraction  of  a  bus. 
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4.2.1  Analog  issues  of  bussed  transmission  lines 

The  electrical  transmission  of  signals  on  a  bus  line  is  an  analog  phenomenon,  although  the 
digital  abstraction  of  logic  design  tries  to  hide  the  analog  nature  of  signal  transmission.  The 
nature  of  signal  transmission  on  a  bus  line  includes  the  propagation  speed  of  signals,  reflections 
of  signals,  superposition  of  wave  forms,  voltage  glitches  and  spikes,  and  signals  attenuation, 
among  others.  Here  we  briefly  discuss  some  of  these  phenomena. 

The  propagation  of  signals  on  a  bussed  transmission  line  is  a  time-consuming  rather  than 
an  instantaneous  event.  The  speed  of  signal  propagation  on  a  bus  is  determined  by  various 
physical  and  geometrical  properties,  such  as  the  material,  shape,  temperature,  and  electrical 
properties  of  the  bus  in  its  environment.  The  length  of  a  bus  line  determines  the  maximal 
duration  that  a  signal  needs  to  propagate  through  the  bus,  which  is  termed  bus-propagation 
delay.  However,  there  are  other  factors  that  affect  the  validity  of  digital  signals  that  propagate 
on  a  bus  line,  thereby  affecting  the  propagation  speed  of  digital  signals  on  the  bus. 

A  bus  has  a  characteristic  impedance  that  depends  on  its  geometrical  and  physical  proper¬ 
ties.  This  characteristic  impedance  is  computed  in  terms  of  the  inductance,  capacity,  and  length 
of  the  bus  (see  [5,  40]).  Impedance  discontinuities  along  the  bus,  such  as  at  connectors  or  at  its 
ends,  cause  reflections  of  a  fraction  of  each  wave  form  passing  through  them.  Reflected  signals 
generate  standing  waves  and  noise  on  the  bus  line,  which  complicate  the  transfer  of  digital  data. 
Signal  reflections  and  termination  can  be  considerably  reduced  by  careful  engineering  of  the 
bus  and  its  connectors,  but  such  fine  tuning  is  rather  complex  and  expensive. 

A  transmission  line  can  simultaneously  propagate  numerous  wave  forms  at  different  locations 
and  in  either  direction.  Different  wave  forms  pass  through  each  other  without  interference  to 
create  the  spatial  and  temporal  sum  of  the  propagating  wave  forms.  This  phenomenon  is  known 
as  the  superposition  principle.  Superposition  of  valid  digital  signals  may  cause  non-valid  digital 
voltage  levels  at  various  places  on  a  bus.  The  effect  of  superposition  of  signals  is  especially 
problematic  with  open-collector  bus  drivers,  where  several  signals,  applied  by  different  modules, 
may  be  traveling  on  the  bus  in  different  directions.  A  discussion  of  wired-OR  glitches,  which 
result  from  superposition  of  signals  on  open- collector  busses,  appears  in  [42]. 

The  number  of  modules  connected  to  a  bus  line  and  the  distances  between  modules  play 
an  important  role  in  the  propagation  of  signals  on  the  bus.  Electrical  signals  traveling  on  a 
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bus  line  experience  some  attenuation,  which  depends  on  the  distance  traveled  and  the  driver’s 
power.  If  several  modules  drive  the  bus  to  the  same  logic  level,  the  bus  may  reach  this  level 
faster  than  if  only  one  module  drives  the  bus.  In  addition,  the  length  of  the  bus  and  the  number 
of  modules  on  it  determine  the  power  at  which  modules  should  drive  electrical  signals  onto  the 
bus  to  guarantee  that  the  signals  driven  are  at  valid  digital  levels. 

As  a  consequence  of  all  the  analog  complications  in  driving  digital  signals  onto  bus  lines, 
most  bus  systems  strive  for  engineering  simplicity  at  the  cost  of  reduced  bus  performance.  In 
Chapter  3  we  discussed  a  bus  model  that  assumes  that  the  voltage  level  on  a  bus  may  not 
be  a  valid  digital  value  before  a  unit  of  bus-settling  delay,  Tbu»,  passes.  In  this  chapter  we 
introduce  another  bus  model,  the  digital  transmission  line  model,  which  attempts  to  capture 
the  transient  nature  of  traveling  digital  signals  on  a  bus  line  and  ignores  the  analog  phenomena 
of  signal  reflections,  waveform  superposition,  and  voltage  glitches  and  spikes.  Very  careful 
design  and  engineering  of  a  bus  can  reduce  much  of  the  analog  phenomena  on  transmission 
lines  with  the  exception  of  the  finite  propagation  speed  of  signals. 

4.2.2  The  digital  transmission  line  bus  model 

The  digital  transmission  line  model  accounts  for  propagation  delays  of  digital  signals  along 
bus  lines,  which  depend  on  the  distances  and  the  directions  that  signals  travel.  This  model 
abstracts  over  the  analog  nature  of  reflected,  superposed,  and  attenuated  signals,  by  assuming 
that  the  propagating  signals  are  always  valid  digital  signals.  The  digital  transmission  line  model 
is  a  model  of  an  idealized  bus,  which  enables  examining  certain  inherent  properties  of  bussed 
systems  (see  for  example  [3,  23,  79,  80,  81,  87]).  A  careful  design  of  high-speed  bus  lines  can 
result  in  a  good  approximation  to  this  idealized  model  (see  [5,  17,  81]). 

In  the  digital  transmission  line  bus  model,  we  make  the  following  assumptions.  The  system 
consists  of  n  modules  that  are  arranged  along  m  parallel  bus  lines.  The  m  bus  lines  all  have 
the  same  length  L.  Each  of  the  n  modules  is  connected  to  all  the  m  bus  lines  at  the  same 
spatial  location,  that  is,  at  the  same  distance  from  the  beginning  of  each  bus  line.  Under  these 
assumptions,  the  distance  between  two  modules  on  the  bus  lines  is  well  defined;  it  is  the  distance 
between  the  modules  as  measured  on  any  of  the  m  bus  lines.  There  is  a  module  at  each  of 
the  two  ends  of  the  bus  lines,  such  that  the  distance  between  the  two  furthest  away  modules  is 
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exactly  L  and  no  other  two  modules  are  at  distance  L  from  each  other. 

Each  module  can  drive  digital  signals  on  any  of  the  m  bus  lines.  All  the  m  busses  have  the 
same  signal  propagation  speed,  which  we  denote  by  V.  Signals  driven  on  a  bus  line  propagate 
at  the  same  speed  and  in  both  directions  on  the  bus.  A  signal  that  a  module  drives  on  any  bus 
line,  therefore,  can  not  be  noticed  at  distance  d  away  from  that  module  before  time  t  =  d/V 
passes.  The  time  it  takes  for  a  signal  to  travel  the  whole  length  L  of  a  bus  line  is  Tp  =  L/V 
and  is  termed  the  bus-propagation  delay.  For  simplicity,  we  assume  that  the  signal  propagation 
speed  V  is  V  =  1.  This  enables  identifying  a  distance  d  on  the  bus  lines  and  the  time  t  that  it 
takes  for  a  signal  to  travel  this  distances  d ,  since  t  =  d/V.  With  this  assumption  we  also  have 
that  Tp  =  L. 

In  the  digital  transmission  line  model,  we  assume  that  signals  propagation  on  bus  lines  is 
a  digital  phenomenon  that  exhibits  no  analog  behavior.  There  are  no  reflections  of  signals  or 
of  fractions  of  signals  anywhere  on  the  bus  lines.  The  bus  lines  are  terminated  properly  and 
signals  reaching  either  end  of  a  bus  line  simply  disappear.  Digital  signals  that  meet  on  a  bus 
line  superpose  in  a  logic  OR  manner  according  to  the  wired-OR  nature  of  the  bus  medium,  that 
is,  at  any  given  point  on  a  bus  line  the  resultant  level  measured  is  always  the  logic  OR  of  the 
digital  signals  passing  there.  No  signal  spikes,  glitches,  or  attenuation  are  experienced;  signals 
are  always  at  valid  digital  levels.  Signals  on  parallel  bus  lines  do  not  interfere  with  each  other, 
that  is,  there  is  no  “cross  talk”  between  bus  lines.  Finally,  we  assume  that  modules  do  not 
experience  any  gate  delays  in  driving  signals  on  bus  lines;  the  only  delays  considered  are  the 
propagation  delays  of  digital  signals  along  bus  lines.  In  spite  of  its  abstract  characterization 
of  bus  lines,  the  digital  transmission  line  model  is  a  useful  tool  for  investigating  the  effects  of 
signal  propagation  delays  on  the  performance  of  various  protocols. 

4.3  General  arrangements  of  modules 

In  this  section,  we  investigate  the  arbitration  time  of  the  binary  arbitration  scheme  for  general 
arrangements  of  modules.  A  widely  held  misconception  is  that  in  the  digital  transmission 
line  model  the  arbitration  time  of  binary  arbitration  is  at  most  4  units  of  bus-propagation 
delay.  Here,  we  formally  disprove  this  conjecture  by  demonstrating  that  the  arbitration  time 
of  the  binary  arbitration  scheme  depends  on  the  arrangement  of  the  arbitrating  modules  in 
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the  system.  We  first  provide  a  scenario  of  module  arrangement  on  m  busses,  for  which  binary 
arbitration  takes  at  least  m/2  units  of  bus-propagation  delay  to  settle.  We  then  prove  that  for 
any  arrangement  of  modules  on  m  busses,  binary  arbitration  stabilizes  after  at  most  m/2  +  2 
units  of  bus-propagation  delay.  Finally,  we  relate  these  results  to  the  binomial  arbitration 
scheme  and  demonstrate  that  it  settles  in  at  most  m/4  +  2  units  of  bus-propagation  delay. 

4.3.1  Lower  bound  for  binary  arbitration 

To  prove  the  lower  bound  on  the  arbitration  time  of  binary  arbitration  with  m  bus  lines  in  the 
digital  transmission  line  model,  we  describe  a  scenario  for  arranging  a  selected  set  of  arbitrating 
modules  on  the  m  bus  lines.  We  assume  that  all  the  arbitrating  modules  start  their  arbitration 
process  simultaneously  and  follow  the  binary  arbitration  protocol,  which  is  described  in  Section 
3.3.2.  We  remind  that  this  protocol  states  that  each  module  applies  its  arbitration  priority  to 
the  m  bus  lines,  and  that  if  a  module  applies  a  logic  0  to  a  certain  bus  line  but  detects  that 
the  bus  line  carries  a  logic  1,  then  the  module  disables  all  its  bits  of  lower  significance  for  as 
long  as  the  conflict  on  that  bus  line  remains.  This  rule  guarantees  that  after  sufficient  delay 
only  the  bits  of  the  highest  arbitration  priority  are  applied  to  the  m  bus  lines.  Until  this  time 
delay  passes,  however,  there  may  be  many  modules  applying  and  disabling  low-order  bits,  which 
may  generate  many  transient  digital  signals  on  the  bus  lines.  The  system  stabilizes  when  all 
the  transient  signals  on  all  the  bus  lines  have  disappeared.  Our  lower  bound  scenario  arranges 
selected  modules  on  the  m  bus  lines  in  such  a  way  that  there  is  a  sequence  of  m/2  transient 
signals,  each  of  which  is  stimulated  by  its  predecessor  in  the  sequence,  that  travel  from  side  to 
side  on  the  m  bus  lines.  This  has  the  effect  of  delaying  system  settlement  until  at  least  m/2 
units  of  bus- propagation  delay  pass. 

Our  lower  bound  scenario  partitions  the  selected  arbitrating  modules  into  two  sets,  which 
we  shall  denote  by  A  and  B.  The  set  A  of  modules  is  located  at  the  very  far  right  end  of 
the  m  bus  lines  and  the  set  B  of  modules  is  located  at  the  very  far  left  end.  The  distances 
between  modules  inside  each  set  are  very  small  compared  to  the  distance  between  the  two  sets. 
The  distance  between  the  two  sets  (between  the  leftmost  module  in  the  right  set  A  and  the 
rightmost  module  in  the  left  set  B)  is  almost  the  whole  length  L  of  the  bus  system.  This  has 
the  effect  that  arbitration  inside  each  of  the  two  sets  settles  much  faster  than  even  the  time 
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required  for  a  signal  to  propagate  from  one  set  to  the  other.  (These  distances  and  delays  will 
be  discussed  in  more  detail  towards  the  end  of  this  subsection.)  Figure  4-1  illustrates  this  high 
level  partitioning  of  the  selected  arbitrating  modules  into  sets  A  and  B. 
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Figure  4-1:  High  level  partitioning  of  the  selected  arbitrating  modules  into  sets  A  and  B.  With  a 
parameter  d  (to  be  determined  later),  the  total  length  of  set  A  is  9d/4,  the  total  length  of  set  B  is  d/2, 
and  the  distance  between  the  two  sets  is  almost  L,  such  that  d  <  L. 


Inside  each  of  the  sets  A  and  B,  modules  are  organized  in  linear  order  of  priorities,  with 
priorities  increasing  from  left  to  right  in  set  A  and  from  right  to  left  in  set  B.  Each  set  by  itself 
settles  rather  fast,  due  to  its  relatively  short  total  length.  However,  the  arbitration  priorities 
in  the  two  sets  are  selected  in  such  a  way  that  they  interact  with  each  other.  Initially,  when 
arbitration  begins,  a  special  “wave  form”  is  generated  by  modules  in  set  A  on  2  top  bus  lines 
and  is  propagated  towards  set  B.  This  special  “wave  form”  arrives  at  set  B  after  the  arbitration 
in  set  B  have  already  settled  and  causes  some  temporary  confusion  there.  As  a  result,  a  similar, 
reflected,  and  shrunk-by-2  “wave  form”  is  generated  by  modules  in  set  B  on  the  next  2  bus  lines 
and  is  propagated  back  towards  set  A,  where  it  causes  a  similar  temporary  confusion.  This, 
in  turn,  results  in  a  similar,  reflected,  and  again  shrunk-by-2  “wave  form”,  which  is  generated 
by  modules  in  set  A  and  is  now  propagated  back  towards  set  B  on  the  next  2  bus  lines.  This 
ping-pong  of  “wave  forms”  lasts  for  m/2  iterations,  since  each  such  iteration  utilizes  2  distinct 
bus  lines.  The  duration  of  each  such  iteration  is  almost  Tp,  since  this  is  the  time  required  by 
any  “wave  form”  to  propagate  from  set  A  to  set  B  or  vice  versa.  The  arbitration  process  of  the 
whole  system  is  therefore  not  completed  before  ( m/2)Tp  time  passes. 
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We  now  describe  the  “ping-pong  wave  forms"  that  propagate  back  and  forth  between  the 
sets  A  and  B.  Each  “ping-pong  wave  form"  is  the  combination  of  two  signals  traveling  together 
in  the  same  speed  and  direction  on  two  consecutive  bus  lines.  Odd-indexed  “ping-pong  wave 
forms”  are  generated  by  modules  in  set  A  and  propagate  towards  set  B  (from  right  to  left), 
while  even-indexed  “ping-pong  wave  forms”  are  generated  by  modules  in  set  B  and  propagate 
towards  set  A  (from  left  to  right).  The  first  “ping-pong  wave  form”  is  spontaneously  generated 
by  set  A  when  arbitration  begins.  The  ith  “ping-pong  wave  form”,  for  1  <  t  <  m/2,  is  generated 
as  a  result  of  receiving  the  (»  —  l)st  “ping-pong  wave  form”.  In  general,  the  ith  “ping-pong 
wave  form”,  for  1  <  i  <  m/2,  can  be  described  as  follows: 

•  a  1-signal  of  duration  2d/2‘  on  bus  line  6m_2i,  and 

•  a  0-signal  of  duration  4d/2‘  on  bus  line  6m_2,_j. 


Figure  4-2  illustrates  the  ith  “ping-pong  wave  form”.  The  parameter  d  is  the  distance  between 
the  modules  generating  the  first  “ping-pong  wave  form”  and  will  be  discussed  later. 
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Figure  4-2:  The  ith  “ping-pong  wave  form”  on  two  consecutive  bus  lines.  This  wave  form  propagates 
from  right  to  left,  that  is,  i  is  assumed  to  be  odd.  For  even  i,  this  wave  form  should  be  reflected. 


We  now  turn  to  describe  the  relative  arrangement  of  modules  inside  the  sets  A  and  B ,  which 
is  responsible  for  the  “ping-pong  wave  forms”  phenomenon.  For  simplicity,  we  focus  first  on  the 
structure  of  set  5,  which  is  somewhat  simpler  than  that  of  set  A.  The  location  of  modules  in 
set  B  and  their  relative  distances  from  each  other  are  of  primary  importance.  The  modules  in 
set  B  are  responsible  for  receiving  the  odd-indexed  “ping-pong  wave  forms”  (first,  third,  etc.) 
coming  from  the  right,  and  for  generating  the  even-indexed  “ping-pong  wave  forms”  (second, 
fourth,  etc.)  that  propagate  to  the  right.  To  examine  the  generation  of  the  second  “ping-pong 
wave  form”,  for  example,  we  need  to  describe  the  location  and  priorities  of  three  modules  in  set 
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B.  These  three  modules  with  arbitration  priorities  plf  p?,  and  pz  are  illustrated  in  Figure  4-3. 
Module  p3  is  at  the  left  end  of  the  bus  system,  module  p2  is  at  distance  d/4  from  the  left  end 
of  the  bus  system,  and  module  p\  is  at  distance  d/2  from  the  left  end  of  the  bus  system.  (The 
parameter  d,  to  be  reminded,  is  related  to  the  duration  of  the  first  “ping-pong  wave  form”.) 
Furthermore,  module  pi  is  the  only  arbitrating  module  (in  both  sets  A  and  B )  with  a  1  on  bus 
6m_ 4]  no  other  arbitrating  module  has  a  1-bit  on  this  bus.  The  space  between  modules  pj  and  p2 
contains  no  other  arbitrating  modules.  The  space  between  modules  pj  and  p3  may  contain  other 
arbitrating  modules  for  generation  of  future  even-indexed  “ping-pong  wave  forms”.  However, 
each  arbitrating  module  in  the  space  between  p?  and  pa  must  agree  with  pi  and  P3  on  their 
high  order  bits,  as  illustrated  in  Figure  4-3. 
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Figure  4-3:  The  arrangement  of  the  three  modules  in  set  B  that  are  responsible  for  receiving  the  first 
“ping-pong  wave  form”  and  for  generating  the  second  “ping-pong  wave  form”.  The  space  between  pi  and 
Pj  contains  no  other  arbitrating  modules.  The  space  between  pi  and  p3  may  contain  other  arbitrating 
modules  for  generation  of  future  even-indexed  “ping-pong  wave  forms” . 

We  next  examine  how  the  arrangement  of  the  three  set -B  modules,  illustrated  in  Figure 
4-3,  receives  the  first  “ping-pong  wave  form”  and  generates  the  second  “ping-pong  wave  form”. 
We  assume  that  the  first  “ping-pong  wave  form”,  which  propagates  from  right  to  left,  arrives 
at  the  location  of  module  pi  at  time  t.  This  first  “ping-pong  wave  form”  consists  of  2  left¬ 
traveling  signals  as  follows:  a  1-signal  of  duration  d  on  bus  line  6m_2  accompanied  by  a  0-signal 
of  duration  2d  on  bus  line  6m_ 3  (see  Figure  4-2).  In  the  following  discussion,  we  keep  track  of 
right-traveling  wave  forms  generated  on  bus  lines  6m_«  and  6m_s,  as  detected  at  the  location 
of  module  pi,  starting  at  time  t  +  d. 


4.3.  GENERAL  ARRANGEMENTS  OF  MODULES 


99 


We  first  concentrate  on  the  wave  form  generated  on  bus  line  6m_4  at  the  location  of  p\ 
after  time  t  +  d.  Notice  that  the  left-traveling  1-signal  on  bus  line  6m-2  (one  part  of  the  first 
“ping-pong  wave  form”)  arrives  at  module  p\  at  time  t,  is  of  duration  d,  and  thus  it  leaves 
module  p\  at  time  t  +  d.  At  time  t  +  d/2  the  leading  edge  of  this  signal  arrives  at  module 
P3,  and  at  time  t  +  5d/4  the  trailing  edge  of  this  signal  leaves  module  pi  (see  Figure  4-3). 
Therefore,  in  the  time  interval  (f  +  d/2,t  +  5d/4),  all  modules  between  pj  and  p3  disable  their 
bits  on  the  bus  lines  below  hm_j.  Specifically,  this  causes  a  right-traveling  0-signal  on  bus  line 
6m_ 3,  originated  at  module  pa,  which  arrives  at  module  p\  at  time  t  +  d.  This  right-traveling 
0-signal  on  bus  line  6m_3  is  terminated  at  time  t  +  5d/4  at  the  location  of  module  p2,  since  the 
signal  on  bus  6m_2  passes  p2  at  that  time.  However,  at  the  location  of  p\ ,  the  right-traveling 
0-signal  on  bus  6m_3  is  detected  until  time  t  +  5d/4  +  d/4  =  t  +  3d/2  (it  takes  time  d/4  for  the 
change  at  pj  to  reach  pi).  In  addition,  the  left- traveling  0-signal  on  bus  line  6m_3  (the  other 
part  of  the  first  “ping-pong  wave  form”)  guarantees  that  no  1-signal  arrives  on  this  bus  from 
the  right  until  time  t  -f-  2d.  The  result  of  all  the  above  discussion  is  that  between  time  /  -)-  d 
and  time  t  +  3d/2  the  digital  values  on  bus  lines  bm~i  through  6m_3  at  the  location  of  pi  agree 
with  the  bits  of  priority  p\.  Consequently,  module  p\  generates  a  1-signal  on  bus  line  6m_4  in 
the  time  interval  (t  +  d,  t  +  3d/2),  which  propagates  both  left  and  right  and  is  of  duration  d/2. 
The  right-traveling  portion  of  this  signal  is  one  part  of  the  second  “ping-pong  wave  form”. 

We  now  concentrate  on  the  wave  form  generated  on  bus  line  &m_5  at  the  location  of  px  after 
time  t  +  d.  The  discussion  in  the  previous  paragraph  about  the  right-traveling  0-signal  on  bus 
line  6m_3  is  also  applicable  to  bus  line  6m_5,  since  the  modules  between  P2  and  P3  disable  all 
their  bits  below  bus  6m_j.  Therefore,  there  is  a  right-traveling  0-signal  on  bus  fcm_5  between 
time  t  -f  d  «.nd  time  t  4-  3d/2.  However,  the  1-signal  on  bus  6m_4,  generated  by  module  p\ 
between  time  t  +  d  and  t  +  3d/2,  propagates  both  to  the  left  and  to  the  right.  The  left-traveling 
portion  of  this  1-signal  on  bus  6m-<  arrives  at  modules  to  the  left  of  pi  just  as  the  left- traveling 
1-signal  on  bus  6m_j  leaves  those  modules.  Consequently,  modules  to  the  left  of  pi  continue 
to  disable  their  bits  on  bus  6m_s  for  at  least  another  d/2  time,  which  is  the  duration  of  the 
1-signal  that  module  p\  generates.  As  a  result,  we  have  a  right-traveling  0-signal  on  bus  line 
6m_5  in  the  time  interval  (t  +  d,f  +  2d),  which  is  the  other  part  of  the  second  “ping-pong  wave 
form”.  The  right-traveling  signals  on  bus  lines  6m_4  and  6m_s  leave  set  B  at  time  t  +  d  on  their 
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way  towards  set  A,  where  a  similar,  reflected,  and  shrunk-by-2  process  occurs. 


The  structure  of  set  A  is  almost  identical  to  that  of  set  B.  The  only  difference  is  that  the 
first  “ping-pong  wave  form”  is  spontaneously  created  by  modules  in  set  A  when  the  arbitration 
process  begins.  Figure  4-4  illustrates  the  five  set- .4  modules  that  are  responsible  for  the  first 
and  the  third  “ping-pong  wave  forms”.  Modules  p4,  p5,  and  pa  spontaneously  create  the  first 
“ping-pong  wave  form”  on  bus  lines  and  bm-3-  To  see  that,  we  concentrate  on  the  left- 
propagating  wave  forms  detected  at  the  location  of  module  p4  immediately  after  arbitration 
starts.  When  arbitration  begins,  module  p4  generates  a  1-signal  on  bus  line  hm_ 2  for  a  duration 
of  d,  since  after  time  d  the  1-signal  that  module  ps  generates  on  bus  line  6m_i  disables  module 
p4  forever.  Also,  when  arbitration  begins,  bus  line  6m_ 3  at  the  location  of  p4  carries  a  0-signal 
for  a  duration  of  2d,  until  the  1-signal  from  module  ps  arrives  from  the  right  to  the  location  of 
module  p4.  The  combination  of  the  signals  on  bus  lines  6m_2  and  6m_3  is  the  first  “ping-pong 
wave  form”  that  propagates  towards  set  B.  Modules  ps,  P7,  and  p@  are  responsible  for  the  third 
“ping-pong  wave  form”  on  bus  lines  hm_ 6  and  The  arrangement  of  these  modules  is  a 

shrunk-by-2  mirror  image  of  the  arrangement  of  set  B. 
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Figure  4-4:  The  arrangement  of  the  five  modules  in  set  A  that  are  responsible  for  creating  the  first 
“ping-pong  wave  form”,  and  for  receiving  the  second  “ping-pong  wave  form”  and  generating  the  third 
“ping-pong  wave  form”.  The  spaces  between  p4  and  ps,  between  ps  and  P6,  and  between  p«  and  P7 
contain  no  other  arbitrating  modules.  The  space  between  P7  and  ps  may  contain  other  arbitrating 
modules  for  generation  of  future  odd-indexed  “ping-pong  wave  forms” . 
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The  scenario  for  module  priorities  and  placement  continues  in  a  recursive  fashion.  For 
example,  the  region  in  set  B ,  which  is  responsible  for  the  fourth  “ping-pong  wave  form”,  is  a 
shrunk-by-4  image  of  the  modules  in  Figure  4-3.  The  three  new  modules  are  placed  in  total 
space  of  d/8  from  the  left  end  of  the  bus  lines,  with  the  leftmost  module  of  the  three  coinciding 
with  the  module  already  there.  The  leftmost  module  on  the  bus  lines,  thus,  has  the  following 
string  (10)m/f2  as  its  arbitration  priority.  Formally,  for  the  generation  of  the  (2fc)th  “ping-pong 
wave  form”,  we  place  a  module  with  priority  (10)2fc~11010m-4*-1  at  distance  d/22k  from  the 
left  end,  and  another  module  with  priority  (10)2*-1010m-4fc  at  distance  2d/22k  from  the  left 
end.  Similar  recursion  is  applied  to  the  structure  of  the  right  set  A. 

We  now  discuss  the  design  parameters  d,  L,  Tp,  and  the  duration  of  the  arbitration  process. 
The  parameter  d  is  the  spacing  between  the  modules  that  generate  the  first  “ping-pong  wave 
form”,  and  the  parameter  L  is  the  length  of  a  bus  line.  The  total  length  occupied  by  the  two 
sets  A  and  B  combined  is  no  more  than  3d,  which  leaves  a  distance  of  at  least  L  -  3d  between 
the  two  sets  for  “ping-pong  wave  forms”  to  travel  back  and  forth.  The  arbitration  scenario, 
thus,  consists  of  m/2  iterations,  each  of  which  takes  at  least  ( L-3d)/L  units  of  bus-propagation 
delay.  To  maximize  the  arbitration  time,  we  need  to  minimize  the  value  of  d.  If  the  system 
design  is  such  that  there  is  no  lower  limit  on  the  distance  between  modules,  then  d  could  be 
made  as  small  as  desirable  and  the  arbitration  process  would  take  m/2  units  of  Tp.  If.  however, 
modules  are  required  to  be  equally  spaced  on  the  bus  lines,  then  the  following  analysis  shows 
that  the  lower  bound  of  m/2  units  of  Tp  is  asymptotically  attainable. 

Suppose  that  A  is  the  spacing  between  any  two  consecutive  modules  on  the  bus  lines.  To 
enable  m/2  iterations  of  the  lower  bound  scenario,  the  duration  of  the  last  “ping-pong  wave 
form”  must  be  at  least  A.  Alternatively,  we  must  have  A  =  2 ~(m/2~,)d,  or  d  —  2m/2-lA. 
However,  on  m  bus  lines  there  are  2m  modules,  which  implies  L  =  (2m  -  1)A.  The  ratio 
(L  -  3 d)j L  is  then  at  least  1  -  I/(2m/2-2),  which  approaches  1  as  m  increases.  This  indicates 
that  asymptotically  almost  the  full  length  of  the  bus  lines  is  traveled  in  each  iteration.  We 
summarize  this  discussion  in  the  following  theorem. 

Theorem  37  There  is  a  scenario  of  module  arrangement  on  m  bus  lines,  such  that  under  the 
digital  transmission  line  bus  model,  the  binary  arbitration  scheme  asymptotically  requires  at 
least  m/2  units  of  bus-propagation  delay  to  settle. 
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4.3.2  Upper  bound  for  binary  arbitration 

In  this  subsection,  we  prove  that  for  any  arbitrary  arrangement  of  modules  on  m  busses,  binary 
arbitration  stabilizes  after  at  most  m/2  +  2  units  of  bus-propagation  delay.  This  upper  bound 
is  derived  by  concentrating  on  the  number  of  0-intervals  in  the  highest  competing  arbitration 
priority  and  on  the  relative  locations  of  arbitrating  modules.  We  first  define  the  number  of 
0-intervals  of  a  codeword. 

Definition  16  The  number  of  0-intervals  of  a  binary  codeword  p  is  the  number  of  intervals  of 
consecutive  0’s  that  p  contains,  disregarding  the  leading  0’s. 

The  nature  of  the  binary  arbitration  protocol  is  such  that  an  interval  of  consecutive  same 
bits  of  a  codeword  can  be  regarded  as  a  basic  unit.  For  an  interval  of  consecutive  l’s  this  is 
the  case,  since  such  interval  cannot  be  interrupted  in  the  middle  (there  is  no  0-bit  there  where 
1-signals  can  penetrate).  An  interval  of  consecutive  l’s  it  thus  either  applied  as  one  unit  or 
entirely  disabled.  An  interval  of  consecutive  0’s  can  be  interrupted  in  the  middle,  but  then  it 
has  the  effect  of  disabling  all  the  bits  below  that  interval,  no  matter  where  inside  the  interval 
the  interruption  occurs.  In  a  binary  arbitration  process,  the  number  r  of  0-intervals  of  the 
highest  competing  priority  is  related  to  the  arbitration  time,  as  the  next  theorem  implies.  The 
theorem  also  relates  the  arbitration  time  to  L ,  the  length  of  the  bus  lines.  This  connection  is 
rather  important,  as  the  proof  relies  on  the  fact  that  arbitration  among  modules  that  are  close 
on  the  bus  lines  terminates  faster  than  among  far  away  modules. 

Theorem  38  Consider  a  binary  arbitration  process  on  m  bus  lines  of  length  L  under  the  digital 
transmission  line  bus  model.  Let  Q  be  the  set  of  arbitrating  priorities,  p  be  the  highest  priority 
in  Q,  and  r  bejhe  number  of  0-intervals  of  p.  Then  the  arbitration  process  settles  after  at  most 
{r  +  2)L  time,  that  is,  there  are  no  more  transient  signals  on  any  bus  line  after  time  t  =  (r  +  2)Z. 

Proof.  Since  the  number  of  0-intervals  of  the  highest  competing  arbitration  priority  p  is  r, 
then  p  is  of  the  form  p  =  O^l^O*1 1,J0*S  •  •  •  l^O*'  l/,+1,  where  k0  >  0;  l},kj  >  0  for  1  <  j  <  r; 
lr+i  >  0;  and  Jfc0  +  lr+ i  +  HJ=o((i  +  kj)  =  m.  In  the  following  discussion,  we  ignore  the  ho 
leading  0’s  since  the  first  ko  bus  lines  carry  0’s  throughout  the  arbitration  process.  For  notation 
simplicity,  we  then  assume  that  ko  =  0.  We  now  prove  the  theorem  by  induction  on  r  for 
arbitrary  values  of  L. 


4.3.  GENERAL  ARRANGEMENTS  OF  MODULES 


103 


Base  case:  r  =  0.  The  codeword  p  consists  of  m  consecutive  l’s,  that  is,  p  =  lm.  This  interval 
of  m  consecutive  l’s  propagates  together  on  the  m  bus  lines,  and  after  at  most  one  unit  of 
bus- propagation  delay  all  bus  lines  have  settled  forever.  Arbitration  in  this  case  takes  no  more 
than  L  time,  which  does  not  exceed  (r  4-  2 )L  time. 

Base  case:  r  =  1.  The  codeword  p  has  the  form  p  =  l^O*1 1,J.  The  first  interval  of  l\ 
consecutive  l’s  propagates  together  on  the  first  l\  bus  lines,  and  after  at  most  one  unit  of 
bus-propagation  delay  all  these  l\  bus  lines  settle  to  l’s  forever.  As  a  result,  any  module  that 
has  some  1-bits  in  the  second  interval  of  k\  bus  lines,  disables  these  bits  after  at  most  one 
unit  of  bus-propagation  delay.  Therefore,  after  at  most  two  units  of  bus- propagation  delay,  the 
second  interval  of  Jbj  bus  lines  settles  to  0’s  forever.  Consequently,  after  at  most  two  units  of 
bus-propagation  delay,  module  p  re-enables  its  last  interval  of  / 2  consecutive  l’s  forever,  which 
brings  the  bus  lines  to  stable  state  after  at  most  three  units  of  bus-propagation  delay.  (See 
Section  4.4.1  for  a  proof  that  this  scenario  is  indeed  possible.)  Arbitration  in  this  case  takes  no 
more  than  3 L  time,  which  does  not  exceed  (r  +  2)1  time. 

Inductive  case:  r  >  1.  The  codeword  p  has  the  form  p  —  l^O*'  l^O*2  •  •  •  l,r+1 .  We  define  the 
set  Q  of  all  arbitrating  modules  in  Q  that  have  their  first  lx  +  k\  bits  identical  to  those  of  p, 
that  is,  Q  =  jg  €  Q  :  q  =  l^O*1  •  •  •}.  We  focus  on  possible  1-signals  sent  by  other  arbitrating 
modules  (from  Q  —  Q)  in  the  second  interval  of  p,  that  is,  the  interval  of  k\  consecutive  0's. 
There  are  three  cases  to  consider: 

(a)  There  are  no  arbitrating  modules  with  1-bits  in  the  second  interval  of  p.  In  this  case  the 
first  three  intervals  of  p,  which  have  the  form  l,10t‘l,J,  behave  like  one  uninterrupted 
interval  that  could  be  replaced  by  an  interval  of  li  +  k\  +  h  consecutive  l’s  with  no  change 
in  tRe  behavior.  The  number  of  0-intervals  remained  to  be  considered  is  now  r  -  1.  By 
induction,  such  arbitration  processes  take  at  most  ((r  -  1)  +  2 )L  <  (r  +  2 )L. 

(b)  There  is  an  arbitrating  module  q  €  Q  -  Q  with  a  1-bit  in  the  second  interval  of  p,  and 
there  is  another  arbitrating  module  p7  €  Q,  such  that  q  is  physically  between  p  and  p7  on 
the  bus  lines  (see  Figure  4-5).  Let  dt  be  the  distance  of  q  from  p  and  let  d 2  be  the  distance 
of  q  from  p7.  Without  loss  of  generality,  we  assume  that  d\  <  d 2.  (This  also  implies  that 
d\  <  L/ 2,  since  otherwise  d\  +d2  >  L).  Then  the  1-signal  that  module  q  generates  in  the 
0-intervad  of  p  has  duration  d\  (it  is  disabled  by  module  p  after  time  d\).  This  1-signal 
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completely  disappears  from  the  system  after  at  most  one  unit  of  bus-propagation  delay, 
since  both  its  right-traveling  and  left-traveling  portions  go  over  the  corresponding  ends  of 
the  bus  lines  after  at  most  time  L.  Consequently,  not  only  is  module  q  disabled  after  one 
unit  of  bus- propagation  delay,  but  also  the  effects  that  it  caused  in  the  system  are  gone 
and  the  modules  of  Q  are  re-enabled  after  that  time.  By  induction  now,  the  modules  of 
Q  complete  the  arbitration  after  at  most  ((r  -  1)  +  2 )L  time,  which  with  the  extra  unit 
of  time  L  for  disabling  modules  like  q  give  an  arbitration  time  of  at  most  (r  +  2 )L. 


P  q  P' 


Figure  4-5:  An  interrupting  module  q  on  the  first  0-interval  of  the  highest  arbitration  priority  p.  There 
is  another  module  pf  with  the  same  first  two  intervals  as  p  on  the  other  side  of  q. 


(c)  There  is  an  arbitrating  module  q  €  Q  -  Q  with  l-bits  in  the  second  interval  of  p,  and 
all  the  modules  in  Q  are  on  the  same  side  of  q  (see  Figure  4-6).  Let  p'  be  the  module 
in  Q  that  is  closest  to  q  and  let  d  be  the  distance  between  p'  and  q.  The  1-signal  that 
module  q  generates  in  the  0-interval  of  p  has  duration  d,  since  it  is  disabled  by  module 
p'  after  time  d.  However,  this  1-signal  may  take  another  time  L  to  completely  disappear 
from  the  system,  since  it  may  be  the  case  that  module  q  is  at  the  very  end  of  the  bus 
lines.  Therefore,  after  at  most  d  +  L  time  the  effects  that  modules  like  q  cause  are  gone 
and  the  modules  of  Q  are  re-enabled  after  that  time.  Notice,  however,  that  the  modules 
of  Q  have  cleared  the  first  0-interval  of  p,  so  that  there  are  r  -  1  more  0-intervals  of  p 
to  consider.  In  addition,  notice  that  the  the  modules  of  Q  are  located  in  a  bus-region  of 
length  at  most  L  -  d.  By  induction  now,  the  modules  of  Q,  on  the  reduced  region  of  the 
bus  lines,  complete  the  arbitration  after  at  most  ((r  -  1)  +  2 )(L  -  d)  =  rL  —  rd  +  L  -  d 
time.  To  this  time  we  need  to  add  the  extra  d  +  L  time  required  to  eliminate  modules 
like  q.  Finally,  we  add  another  d  units  of  time  to  allow  the  final  signals  of  p  to  propagate 
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beyond  the  region  of  length  L  —  d  and  to  cover  the  full  length  of  the  bus  lines.  The  total 
time  is,  therefore,  (rL-rd  +  L-d)  +  (d+L)  +  d  =  rL-rd  +  2L  =  (r  +  2)L-rd<  (r  +  2)L, 


as  required. 

p 


P' 


q 


Figure  4-6:  An  interrupting  module  q  on  the  first  0-interval  of  the  highest  arbitration  priority  p.  All 
the  arbitrating  modules  with  the  same  first  two  intervals  as  p  are  on  the  same  side  of  q. 


We  conclude  that  any  binary  arbitration  process  on  bus  lines  of  length  L ,  with  p,  the  highest 
competing  arbitration  priority,  having  r  0-intervals,  completes  after  (r  +  2)1  time.  H 

Theorem  38  bounds  the  arbitration  time  of  any  binary  arbitration  process  by  (r  +  2)L,  where 
r  is  the  number  of  0-intervals  in  the  highest  arbitrating  priority  and  L  is  the  length  of  the  bus 
lines.  With  m  bus  lines  to  arbitrate,  the  number  of  0-intervals  of  any  arbitration  priority  is  no 
more  than  m/2.  In  addition,  we  assume  that  L  =  Tp,  where  Tp  is  the  bus- propagation  delay. 
These  observations  imply  the  following  corollary. 

Corollary  30  For  any  binary  arbitration  process  on  m  bus  lines  under  the  digital  transmission 
line  bus  model,  arbitration  settles  in  at  most  m/2  +  2  units  of  bus-propagation  delay. 


4.3.3  Lower  and  upper  bounds  for  binomial  arbitration 

Binomial  arbitration  uses  the  same  arbitration  protocol  as  binary  arbitration.  The  results  of 
the  preceding  subsections,  which  provided  lower  and  upper  bounds  on  the  arbitration  time  of 
binary  arbitration,  are,  therefore,  directly  applicable  to  binomial  arbitration  as  well. 

A  lower  bound  scenario,  similar  to  that  of  Theorem  37,  can  be  applied  to  the  binomial 
arbitration  scheme.  The  only  difference  is  that  the  binomial  arbitration  priorities  have  no 
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more  than  m/4  O-intervals,  where  “ping-pong  wave  forms"  can  penetrate  and  cause  temporary 
confusion.  This  implies  the  following  corollary. 

Corollary  40  There  is  a  scenario  of  module  arrangement  on  m  bus  lines,  such  that  under  the 
digital  transmission  line  bus  model,  the  binomial  arbitration  scheme  asymptotically  requires  at 
least  m/4  units  of  bus-propagation  delay  to  settle. 

The  upper  bound  for  binomial  arbitration  is  derivable  from  Theorem  38.  Since  for  binomial 
arbitration  on  m  bus  lines  any  arbitration  priority  has  at  most  m/2  intervals,  the  number  of 
O-intervals  in  any  priority  is  no  more  than  m/4.  This  implies  the  following  corollary. 

Corollary  41  For  any  binomial  arbitration  process  on  m  bus  lines  under  the  digital  transmis¬ 
sion  line  bus  model,  arbitration  settles  in  at  most  m/4  +  2  units  of  bus-propagation  delay. 


4.4  Linear  arrangements  of  modules 

In  this  section  we  examine  linear  arrangements  of  modules  in  increasing  order  of  priorities  with 
the  modules  equally  spaced  on  the  bus  lines.  For  such  arrangements,  we  show  that  3  units  of 
bus-propagation  delay  are  necessary  for  binary  arbitration  to  settle.  We  also  sketch  an  argument 
that  indicates  that  3  units  of  bus-propagation  delay,  rather  than  the  4  claimed  in  [79,  81],  are 
asymptotically  sufficient  for  binary  arbitration. 

4.4.1  Lower  bound  for  binary  arbitration 

To  demonstrate  a  lower  bound  of  3  units  of  bus-propagation  delay  on  the  arbitration  time  of 
binary  arbitration,  we  present  an  arrangement  of  two  modules  as  in  Figure  4-7.  The  arbitration 
priority  p  of  the  module  on  the  left  side  is  lm-201  and  the  arbitration  priority  q  of  the  module 
on  the  right  side  is  0m_210.  We  use  d  to  denote  the  distance  between  modules  p  and  q.  When 
arbitration  begins  module  q  sends  its  1-bit  towards  module  p  during  the  time  interval  (0,d), 
since  after  time  d  the  high  order  bits  of  p  disable  the  1-bit  of  q.  At  the  location  of  p,  the 
1-signal  on  bus  6i  is  detected  during  the  time  interval  (d,2d),  which  causes  p  to  disable  its 
last  bit  of  1  during  that  time  interval.  Only  after  time  2d,  module  p  re-enables  its  last  bit  of 
1,  but  it  takes  slightly  more  than  time  d  for  this  change  to  propagate  throughout  the  system. 
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The  arbitration  time,  therefore,  is  at  least  3d.  The  reader  may  verify  that  if  A  is  the  distance 
between  consecutive  modules  on  the  bus  lines,  then  the  distance  between  modules  p  and  q  in 
Figure  4-7  is  d  =  (2m  -  5)A.  The  total  length  of  the  bus  lines  is  L  =  (2m  -  1)A,  and  thus  the 
ratio  dj L  asymptotically  approaches  1.  This  shows  that  the  arbitration  time  of  3d  approaches 
3  units  if  bus-propagation  delay  asymptotically,  as  m  increases. 


Figure  4-7:  Linear  arrangement  of  2  modules  very  close  to  the  two  ends  of  the  bus  lines.  The  arbitration 
process  on  this  arrangement  asymptotically  takes  3  units  of  bus-propagation  delay. 

4.4.2  Upper  bound  for  binary  arbitration 

We  now  sketch  an  argument  that  indicates  that  the  arbitration  time  of  binary  arbitration  can 
be  shown  to  be  close  to  3  units  of  bus-propagation  delay.  The  argument  involves  inspection  of 
several  cases  and  only  a  high-level  description  of  it  is  presented  here.  With  m  bus  lines  there 
are  2m  modules  and  the  total  length  of  the  bus  system  is  L.  We  partition  the  modules  into 
2*  subregions,  each  of  length  L/2k,  according  to  the  first  k  bits  of  their  arbitration  priorities. 
By  inspecting  each  of  the  2k  subregions,  one  can  verify  that  if  the  highest  priority  is  in  a 
given  subregion,  then  after  at  most  2  units  of  bus-propagation  delay  all  the  possible  transient 
signals  sent  by  modules  in  lower-priority  subregions  have  disappeared.  This  leaves  only  the 
subregion  under  inspection,  whose  length  is  L/2k,  for  the  rest  of  the  arbitration,  which  we  shall 
analyze  recursively.  In  addition,  after  the  arbitration  is  completed  on  the  inspected  subregion 
of  length  L/2k,  at  most  another  1  -  1/2*  units  of  bus- propagation  delay  are  required  for  the 
bit-signals  of  the  highest  priority  to  spread  throughout  the  bus  lines.  If  we  let  T(n)  denote  the 
maximal  time  required  by  binary  arbitration  on  n  modules,  then  we  get  the  following  recurrence: 
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T(n)  =  2-+T(n/2*)-+(l  —  1/2*),  which  solves  to  give  T(n)  =  3  +  2/(2*  -  1).  Now,  as  k  increases 
(there  are  more  cases  to  inspect),  the  maximal  arbitration  time  can  be  shown  to  asymptotically 
approach  3  units  of  bus-propagation  delay. 

4.5  Discussion  and  extensions 

In  this  section,  we  discuss  the  results  of  this  chapter  and  indicate  directions  for  further  research. 

4.5.1  Discussion 

In  this  chapter,  we  investigated  how  the  finite  propagation  speed  of  signals  on  bussed  transmis¬ 
sion  lines  affects  the  performance  of  the  priority  arbitration  schemes  of  Chapter  3.  We  formally 
disproved  Taub’s  conjecture  by  providing  a  general  scenario  of  module  arrangement  on  m  busses, 
for  which  binary  arbitration  takes  at  least  m/2  units  of  bus-propagation  delay.  We  also  proved 
that  for  any  arrangement  of  modules  on  m  busses,  binary  arbitration  settles  in  at  most  m/ 2  +  2 
units  of  bus-propagation  delay,  while  binomial  arbitration  settles  in  at  most  m/4  +  2  units  of 
bus-propagation  delay.  This  demonstrates  the  superiority  of  binomial  arbitration  for  general 
arrangements  of  modules  under  the  digital  transmission  line  model.  For  linear  arrangements  of 
modules  in  increasing  order  of  priorities  and  equal  spacings  between  modules,  we  showed  that 
3  units  of  bus-propagation  delay  are  necessary  for  binary  arbitration  to  settle,  and  we  indicated 
that  3  units  of  bus- propagation  delay  are  also  asymptotically  sufficient.  System  designers  and 
engineers  may  wish  to  reconsider  the  use  of  Taub’s  assumptions  and  analyses,  since  different 
arrangements  of  system  modules  exhibit  substantially  different  behavior. 

4.5.2  Further  research 

Several  directions  for  extending  the  results  of  this  chapter  are  listed. 

•  Average-case  arbitration  time  of  binary  arbitration  for  arbitrary  and  linear  arrangements. 

•  Linear  arrangements  of  modules  with  arbitrary  spacings  between  modules. 

•  The  performance  of  binomial  arbitration  for  linear  arrangements  of  modules. 

•  Models  of  bussed  transmission  lines  that  characterize  other  aspects  of  the  media. 
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Conclusion 


Bussed  interconnections  are  extensively  used  in  many  digital  systems.  Investigating  the  charac¬ 
teristics,  capabilities,  and  organization  of  bussed  systems  are  the  subject  of  ongoing  research.  In 
this  thesis,  we  focused  on  two  application  domains  for  busses:  communication  architectures  and 
control  mechanisms,  and  examined  the  capabilities  of  busses  as  interconnection  media,  compu¬ 
tation  devices,  and  transmission  channels.  This  chapter  presents  some  concluding  remarks  and 
motiva'as  further  research  on  bussed  interconnections,  in  general,  and  on  each  of  the  aspects  of 
bussed  systems  that  this  thesis  explored,  in  particular. 


5.1  Bussed  interconnections 

Busses  are  shared  communication  media.  A  single  bus  can  only  implement  one  communication 
transaction  at  any  given  time  and  thus  constitutes  a  scarce  resource  that  must  be  utilized  intel¬ 
ligently.  Much  research  is  directed  at  investigating  techniques  and  mechanisms  that  can  enrich 
the  bandwidth  of  a  bus.  Several  techniques,  such  as  time  multiplexing,  frequency  multiplexing, 
spatial  multiplexing,  and  angular  multiplexing  have  been  suggested  for  some  communication 
media,  such  as  radio  channels  and  optical  communications  (see  [12,  13,  78]).  Some  of  these 
techniques  have  also  been  applied  to  electrical  busses,  but  a  more  thorough  exploration  of  bus 
multiplexing  techniques  is  required. 

Busses  enable  communication  among  several  system  modules,  in  contrast  with  point-to-point 
wires  that  establish  communication  only  between  pairs  of  modules.  This  property  of  busses 
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may  or  may  not  be  desirable,  depending  on  the  application.  On  busses,  any  communication 
transaction,  whether  a  one-to-one  or  a  broadcast,  can  be  detected  by  all  system  modules,  while 
point-to-point  wires  feature  privacy  of  communication.  Busses  require  sophisticated  controlling 
mechanisms  and  protocols  to  enable  sharing  and  to  support  sequencing  of  transactions,  while 
controlling  the  communication  with  point-to-point  wires  is  somewhat  more  straightforward. 
Busses,  however,  offer  simple,  standard,  and  scalable  communication  channels,  which  are  the 
desired  features  of  many  digital  systems. 

Bus  technology  is  more  complicated  than  the  technology  of  direct  communication  channels. 
Signal  propagation  on  busses  is  a  complex  phenomenon  that  is  ignored  or  poorly  dealt  with  in 
many  systems.  Bus  driving  technologies  use  special  drivers  for  transmitting  signals  along  busses. 
Most  digital  systems  employ  the  digital  abstraction  and  ignore  the  analog  nature  of  busses.  But 
even  with  the  digital  abstraction,  some  analog  issues  of  busses  may  still  be  noticeable,  such  as 
effects  of  signal  reflections,  transient  glitches,  and  analog  noise.  To  overcome  these  issues,  most 
digital  busses  are  slowed  down  until  they  work  properly.  As  a  result,  digital  communication 
over  busses  tend  to  be  slower  than  the  communication  over  direct  channels.  These  penalties 
can  be  minimized  by  careful  engineering  of  the  electrical  bus  in  its  intended  environment. 

5.2  Communication  architectures 

Many  schemes  have  been  suggested  as  the  interconnection  infrastructure  for  supporting  various 
communication  patterns  in  digital  systems,  including  point-to-point  wires,  multistage  inter¬ 
connection  networks,  and  bussed  interconnections.  In  Chapter  2,  we  investigated  how  busses 
(multiple-pin  wires)  can  be  employed  to  efficiently  realize  certain  classes  of  permutations  among 
modules  in  a  digital  system.  We  demonstrated  that  by  connecting  modules  with  bussed  inter¬ 
connections,  as  opposed  to  point-to-point  wires,  the  number  of  pins  per  module  can  often  be 
significantly  reduced. 

Our  bussed  approach  to  realizing  permutations  compares  favorably  with  both  the  point- 
to-point  and  the  multistage-interconnect  approaches.  Bussed  permutation  architectures  realize 
general  classes  of  permutations  in  one  clock  cycle,  exhibit  small  number  of  pins  per  module, 
and  use  virtually  no  switching  hardware.  Point-to-point  architectures,  for  comparison,  can 
support  any  communication  pattern  in  one  clock  cycle,  utilize  no  switching  hardware,  but  use 
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many  pins  per  module.  Multistage  interconnection  architectures,  as  another  alternative,  realize 
general  classes  of  permutations,  exhibit  a  constant  number  of  pins  per  module,  but  operate  in 
multiple  clock  cycles  and  use  a  considerable  amount  of  switching  hardware.  We  conclude  that 
bussed  interconnections  constitute  an  attractive  alternative  as  a  communication  architecture. 
It  would  be  interesting  to  study  other  classes  of  communication  patterns  that  can  be  efficiently 
implemented  on  bussed  interconnections. 

Several  theoretical  studies  of  systems  with  bussed  interconnections  use  hypergraphs  to  model 
such  systems.  The  topology  of  a  system  with  bussed  interconnections  can  be  modeled  as  a 
hypergraph,  much  as  the  topology  of  a  system  with  point-to-point  wires  can  be  modeled  as  a 
graph.  (See  [9]  for  definitions  and  basic  properties  of  graphs  and  hypergraphs.)  In  systems 
with  bussed  interconnections,  system  modules  are  modeled  as  hypergraph  nodes  and  the  busses 
(multiple-pin  wires)  are  modeled  as  hyperedges.  This  analogy  enables  many  graph-theoretic 
results  to  be  interpreted  in  the  domain  of  architectural  design,  as  was  done  for  instance  in 
[10,  11,  13,  30,  48,  49,  64,  73,  77].  We  believe  that  more  research  in  this  direction  would  be 
fruitful. 

The  problem  of  realizing  permutations  on  uniform  architectures  in  several  clock  cycles 
presents  an  interesting  direction  for  further  exploration.  Our  research  have  demonstrated  that 
cyclic  shifts,  for  example,  can  be  uniformly  realized  in  t  clock  cycles  by  uniform  architectures 
with  0(n1/l2t )  pins  per  module.  It  would  be  interesting  to  develop  a  pin-time  tradeoff  for  general 
classes  of  permutations  on  bussed  architectures,  similar  to  the  tradeoff  exhibited  by  multistage 
interconnection  networks  and  point-to-point  wires.  An  advantage  of  generalized  pin-time  bussed 
interconnections,  over  multistage  interconnection  networks,  would  be  the  avoidance  of  special 
switching  hardware. 

5.3  Control  mechanisms 

Numerous  digital  systems  use  busses  for  implementing  many  control  mechanisms.  Busses  are 
useful  media  for  broadcasting  control  signals  and  for  performing  various  systemwide  protocols. 
In  Chapter  3,  we  explored  how  busses  can  be  efficiently  used  for  arbitration.  We  focused  on 
distributed  asynchronous  priority  arbitration  schemes  and  demonstrated  that  by  using  data- 
dependent  analysis,  certain  popular  mechanisms  can  be  significantly  improved. 
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In  Chapter  3,  we  investigated  bussed  priority  arbitration  mechanisms  under  a  standard 
digital  bus  model  that  assumes  a  time  unit  of  bus-settling  delay  for  a  bus  to  stabilize  to  a  valid 
logic  value.  A  more  elaborate  bus  model  that  takes  into  account  distances  between  modules  and 
signals  propagation  was  examined  in  Chapter  4.  In  both  of  these  bus  models,  the  superiority  of 
the  binomial  arbitration  scheme  over  the  binary  arbitration  scheme  was  established.  Analyzing 
these  arbitration  schemes  in  an  analog  model  of  bus  lines,  which  models  various  transient  effects, 
would  probably  be  a  difficult  task.  However,  simulating  the  analog  behavior  of  these  arbitration 
schemes  could  be  a  tractable  goal. 

On  a  more  general  note,  the  domain  of  data-dependent  analysis  of  digital  systems  has  not 
been  investigated  much  in  the  past.  The  results  of  our  work  demonstrate  that  a  careful  analysis 
of  the  delays  experienced  in  existing  systems,  may  result  in  an  in.*; roved  performance  of  such 
systems  without  changing  them.  A  more  systematic  approach  to  analyzing  data-dependent 
delays  in  digital  systems  will  prove  as  a  valuable  tool  for  digital  circuit  designers. 

5.4  Transmission  lines 

In  Chapter  4  we  introduced  and  examined  a  digital  transmission  line  model  for  a  bus.  In  fact, 
transmission  lines  exhibit  analog  behavior,  but  for  the  purposes  of  digital  computation  they 
can  be  modeled  as  digital  devices.  The  transmission  line  model  enables  a  bus  line  to  carry 
multiple  transactions  at  different  locations  simultaneously.  This  feature  of  a  bus  is  utilized  in 
other  shared  media,  such  as  radio  channels  and  optical  communication,  but  mostly  is  ignored 
in  electrical  busses.  It  would  be  interesting  to  explore  ways  for  using  the  transmission  line 
properties  of  electrical  busses  as  well. 

The  design  of  digital  communication  protocols  over  busses  should  be  a  careful  engineering 
task,  since  high-speed  busses  are  in  effect  analog  transmission  lines.  Many  bus  systems  work 
properly  only  because  the  busses  are  slowed  down  until  their  analog  behavior  can  be  neglected 
and  the  digital  functions  are  correctly  performed.  However,  ignoring  the  analog  nature  of 
busses  results  in  severe  limitations  to  the  performance  of  many  bussed  systems.  It  would  be 
interesting  to  investigate  other  models  of  transmission  lines  that  capture  somewhat  more  of  the 
analog  behavior  of  this  media. 
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